High Availability for Panzura Nodes

This guide describes the High Availability (HA) solutions for protecting Panzura nodes. HA enables a standby node to take over for an active node that becomes unavailable.

Panzura node HA Solutions

The following HA solutions are supported:

HA Local: An active node is protected by a dedicated standby. When the active node fails, the passive standby assumes its identity and takes over operations. The takeover operation can be automatic or manual. HA Local is similar to the methods used by legacy enterprise storage product. In this configuration, an active node is protected by a dedicated, passive standby. When the active node fails, the standby takes over ownership of the file system and the node operations. The following HA Local options are supported:

Local: The active and standby nodes have different hostnames and IP addresses.

Local with shared address: The active node and passive standby have an additional shared hostname and IP address, which simplifies the takeover process. This is required for Auto Failover. (Maximum length of the shared hostname is 15 characters.)

HA Global: One or more nodes are protected by one or more shared standbys, which can be separated geographically from the nodes they protect.

Sample HA Deployment

The following figure shows a Panzura deployment with three working sites—Los Angeles, London, and Paris—and two sites provisioned for HA—Phoenix, and Amsterdam. A Panzura node is physically deployed at each site. Users at the three working sites connect to their local node, have a complete view of the shared file system, and experience LAN access speeds to the data in the global file system.

The figure shows HA options deployed as follows:

HA-Global: The node in Amsterdam protects both subordinates in London and Paris, as well as the Master node in LA.

HA-Local: The node in Phoenix is dedicated to protecting the master node in Los Angeles.

The following HA-Local options are supported:

• Local: The active and standby nodes have different hostnames and IP addresses.

• Local with shared address: The active node and passive standby have an additional shared hostname and IP address, which simplifies the takeover process. (Maximum length of the shared hostname is 15 characters.)

Auto Failover

HA Local can be configured for Auto Failover. Auto Failover enables either of the nodes in an active- standby pair that have a shared virtual IP (VIP) address to automatically perform a failover.

In an Auto Failover configuration, both the active and standby nodes regularly exchange health and status information. The nodes regularly exchange status information in two ways:

  • Directly over a peer-to-peer connection (SSH)

  • Post status information to the cloud in state files

Auto Failover Triggers

In an HA Local pair configured for Auto Failover, either the active or standby can trigger a failover.

Causes for Active node To Initiate Failover

When the Active loses connectivity to the cloud and the peer, it will change its state to standby and stop user connections. This is to avoid split brain, because most likely the standby triggered the failover and became active.

Causes for Standby node To Initiate Failover

Standby will trigger a failover, when any of the following conditions occur.

  • Peer connection is down while cloud upload and download are good. The timestamp in the active state has not been updated for a configured time.

  • Active state file indicates that active changed state. This happens when active decides it’s health is bad and could not communicate with the standby using the local path.

  • Active triggers a takeover.

  • If the standby can’t reach the cloud, it will keep its current state.

Status Information Exchanged by the Active and Standby nodes

The active and standby nodes exchange information that provides them the ability to make a decision on triggering a takeover. The following information is exchanged between the two peers, over an SSH connection that is automatically established between the two nodes.

  • Cloud status: Determined by a configured number of upload/download failures over a period of time.

  • File system status: Status of the file system (based on number of metaslab errors encountered).

  • Import status: Based on whether the node successfully imports the file system after a reboot

    (either forced or unscheduled).

  • Critical process failed: Triggers failover if all remaining retries for a process fail.

  • Snapshot sync status: Determines eligibility for a failover based on snapshot sync status. If the snapshot sync status is 50 or more snapshots behind, failover is not allowed.

  • Scheduled reboot: If the active node is scheduled to reboot, HA takes this into account and does not perform a failover when the rebooting node goes offline during its reboot.

  • State change: Used to figure out whether the takeover is impending.

Requirements for Auto Failover

  • Both the active and standby nodes must be in the same subnet.

  • They must use a shared IP address and hostname. These must be registered in DNS. Clients and other devices will reach whichever node is active using the shared address. This prevents the need to reconfigure DNS following a failover.

Auto-Failover Default Setting

Auto Failover is enabled by default in new HA Local configurations created on nodes running 7.1. For HA configurations created in earlier software releases, you can upgrade to the 7.1 release and enable auto-failover using the WebUI.

 

  • Auto Failover is supported in deployments where the standby is configured as an HA-Local node.

  • Auto Failover requires a virtual IP (VIP) address. Auto Failover is supported only for HA

    configurations that use a VIP.

  • Auto Failover is not supported on Panzura Freedom Virtual Hard Disk (VHD) for Microsoft. This is because VIPs are not supported in the Azure cloud.

Setting Up HA

HA can be configured either during initial setup using the setup wizard, or later using the management WebUI.

Setting Up Local HA with Auto Failover

To set up Local HA with Auto Failover, use the following steps.

During Initial Setup

When configuring the secondary node (the one that initially will be the standby), use the following settings in the Role section of the wizard:

  • Configuration Mode: HA Local

  • Auto Failover: Enable

  • Shared DNS Hostname: DNS hostname shared by the active-standby pair.

  • Shared IP Address: IP address that is mapped to the shared hostname on the DNS server.

  • Peer-to-Peer Authentication Key: Click Upload to load the Master node’s authentication key onto the secondary (HA Local) node.

No specific settings are required on the primary node (the one that initially will be the active node).

Setting Up Auto Failover After Software Upgrade

If the nodes you plan to configure for Auto Failover are already deployed, use the following steps:

  1. Navigate to the Master node and log in to the WebUI.

  2. Navigate to Management > High Availability.

  3. Set the Virtual IP option to enable, if not already enabled.

  4. Enter the shared hostname and IP address for the active-standby pair of nodes.

  5. Set the Auto Failover option to enable.

  6. Click Done.

  7. Click Save to write the changes to the configuration.

Auto-Configuration Options

Option

Description

Default

Primary node

node that initially will be the active node in the Auto Failover pair. This node remains the active node until there is a failover.

(none)

Virtual IP

Enables the active-standby pair to use a shared IP address and hostname. The Virtual IP (VIP) option allows clients and other devices to reach whichever node is active, even if a failover has occurred.

(none)

Shared IP

IP address shared by the active-standby pair. This is the address that is mapped to the active pair’s shared hostname in DNS.

(none)

Option

Description

Default

Time To Wait for Maintenance Reboot

If the active node has a scheduled reboot (typically for maintenance), this is the number of minutes the standby node allows for the reboot to occur. This timer prevents unnecessary failovers that occur because the standby node assumes the active node is unavailable.

12

Number of Allowed Dirty Snapshots

Maximum number of un-synced snapshots the active node can have, and still remain eligible for failover to the standby.

The unsynced snapshots reside in the active node’s dirty cache, in the lost+found folder on the failed node. If the file can be rebooted, the un-synced snapshots can be recovered from this folder.

50

Peer Update Threshold

Maximum number of seconds the active and standby nodes wait for updates from one another. These updates are exchanged directly between the nodes over SSH.

If the standby node does not receive an update from the active node before this threshold expires, failover may occur. (The other failover criteria also must be met. See Status Information Exchanged by the Active and Standby nodes.)

200

Cloud Update Threshold

Maximum number of seconds the active and standby nodes are allowed to take to send status updates to the cloud. These updates are not exchanged directly between the nodes but instead are read by each node from the cloud.

10

Cloud Failure Count

Maximum number of acceptable cloud failures.

 20

Setting Up HA Local (no Auto Failover)

To set up HA Local (with no Auto Failover), use the following steps.

During Initial Setup

When configuring the secondary node, use the following settings in the Role section of the wizard:

  • Configuration Mode: HA Local

  • Auto Failover: Disable

  • (optional) Shared DNS Hostname: DNS hostname shared by the active-standby pair.

  • (optional) Shared IP Address: IP address that is mapped to the shared hostname on the DNS server.

  • Peer-to-Peer Authentication Key: Click Upload to load the Master node’s authentication key onto the secondary (HA Local) node.

The shared hostname and IP address are optional. If you do not configure them, you will need to update DNS to point to this node following a failover.

Using the WebUI

After the HA Local active-standby nodes are deployed, you can change HA settings from the WebUI of the Master node.

  1. Navigate to the Master node and log in to the WebUI.

  2. Navigate to Management > High Availability.

  3. Set the Virtual IP option to enable, if not already enabled.

  4. Enter the shared hostname and IP address for the active-standby pair of nodes.

  5. Set the Auto Failover option to enable.

  6. Click Done.

  7. Click Save to write the changes to the configuration.

Setting Up HA Global (no Auto Failover)

To set up HA Global (with no Auto Failover), use the following steps.

During Initial Setup

When configuring the node that will be the global standby, use the following settings in the Role section of the wizard:

  • Configuration Mode: HA Global

  • Peer-to-Peer Authentication Key: Click Upload to load the Master node’s authentication key onto the secondary (HA Global) node.

Using the WebUI

Use the setup wizard to configure the global standby.

Viewing HA Status

To view HA status, log in to the WebUI and navigate to the Cloud File System Dashboard. HA status is shown in the HA node Status dashlet (located by default under the Active node Status dashlet).

 

The HA node Status dashlet shows the following information:

Column

Description

File System

Name of the file system protected by HA.

To find the node associated with the file system, see the File System and node Hostname columns in the Active node Status dashlet.

Note: For Global HA, the file system name is “All”.

Primary / Secondary

File system names that are being protected.

  • Local HA: Names of the file systems on the nodes in the active-standby air. The primary node is the one that initially is the active node. The asterisk ( * ) indicates that the node that is currently active.

  • Global HA: Name of the node that is configured as the standby for Global HA.

State

The HA state of the file system:

  • Ready: Failover is possible.

  • Not Ready: Failover is not possible because at least one of the nodes is currently not meeting criteria for failover. For example, in an Auto Failover configuration, if the active node’s number of dirty snapshots is too high, an automated failover is not allowed.

  • Down: The file system is down

Auto Failover

State of the Auto Failover feature.

Auto Failover Details

To display Auto Failover details for an active-standby pair of nodes, click on a row in the
Primary / Secondary column of the HA node Status dashlet. The Auto Failover Status dialog appears:

 

In this example, Auto Failover details for active-standby pair cc8 / cc7. The active node in the pair is cc8.

High Availability Takeover

To perform a takeover, the active node must have failed or powered down. The HA standby can be activated only if the original active controller is no longer online. There is a delay between the time that the active node becomes unavailable and when the takeover is possible.

Takeover for HA-Global or HA-Local (no shared address)

Follow this procedure for HA-Global or HA-Local without shared address. See Takeover for HA-Local with shared address for takeover with shared address.

  1. Verify that the node that is the target of the takeover is down.

  2. Log in to the standby node if you are not already logged in, and verify that it is the standby. Click the "i" in the upper right corner of the Web UI and verify that the Configuration Mode is Standby. If you have configured HA-Local and there are multiple HA Local pairs in the CloudFS, verify that you are on the correct one.

  3. If this is a planned takeover, check the sync status by opening the Dashboard page and looking at the Active node Status and Spare node Status sections. If it is not a planned takeover, the process will take longer if the nodes are not synchronized.

  4. Select Maintenance > High Availability.

  5. Click Takeover.

  6. Click OK to continue.
    The Takeover process log appears and provides process information. IMPORTANT: Do not close the process log window.

  7. When the message “Takeover Complete,” appears, click OK to continue.
    IMPORTANT: Do not change any DNS settings until the Takeover Complete message is displayed. Doing so could cause the node to become confused over which node is the source.

  8. On the DNS server, locate the active records for the previous active node and the standby node.

  9. Switch the IP addresses so the standby node is now the active node.

  10. On the new active node go to Configuration > Active Directory.

  11. Rejoin the node to the Active Directory.

  12. Verify that the DNS is pointing to the correct node, that the IP addresses were properly switched.

  13. When the previously active node becomes available, you can bring it up as the new stand-by node.

The process is now complete. The new active node should be up and running.

Takeover for HA-Local with shared address

As described in Important Information About HA-Local with Shared Address, the shared IP address/hostname that you configured when setting up the HA-Local standby is the one that is used to access the active node during normal operations. You will use the same address/hostname to access the standby node when it comes up as the new active node.

  1. Verify that the active node is down.

  2. Log in to the standby node, and verify that it is the standby. Click the "i" in the upper right corner of the WebUI and verify that the Configuration Mode is Standby. If there are multiple HA-Local pairs in the CloudFS, verify that you are on the correct one.

  3. If this is a planned takeover, check the sync status by opening the Dashboard page and looking at the Active node Status and Spare node Status sections. If it is not a planned takeover, the process will take longer if the nodes are not synchronized.

  4. Select Maintenance > High Availability.

  5. Click Takeover.

  6. Click OK to continue.
    The Takeover process log appears and provides process information. IMPORTANT: IMPORTANT: Do not close the process log window.

  7. When the message “Takeover Complete,” appears, click OK to continue.

  8. On the new active node go to Configuration > Active Directory.

  9. Rejoin the node to Active Directory and proceed with normal operations.

  10. When the previously active node becomes available, you can bring it up as the new stand-by node.

The process is now complete. The new active node should be up and running.

 

Failback (HA-Local only)

Failback applies only to HA Local (with or without shared address). For HA Global, there is no failback process. Following an HA Global takeover, you must reinitialize and reconfigure the failed node to add it again as either an active or standby node.

  1. Verify that snapshots are in sync by checking the Dashboard page on the current active node (the former standby node), and that there is no dirty cache.

  2. Shut down the current active node.

  3. Bring the current standby node (the original active node) up and sign in.

  4. Follow the steps in Takeover to switch back to the original active node.