Technical Tip: Restoring HA Primary role after a failover using 'diag sys ha reset uptime' (ha 'set override disable' context)

cgustave · ‎04-08-2011

Description

The article describes how to restore the Primary role to the cluster unit 'preferred' Primary after a fail-over has taken place.

The goal is to illustrate the use of the CLI command 'diag sys ha reset-uptime' on a simple scenario.

The command 'diag sys ha reset-uptime' is documented in 'FortiOS Handbook: High Availability' documents available at https://docs.fortinet.com.

It is recommended to first read the 'Primary unit selection' chapter from 'FortiOS Handbook: High Availability'.

Note:

The use of 'diag sys ha reset-uptime' is only relevant when the cluster is configured with 'set ha override disabled'.

Scope

FortiGate.

Solution

Pros and cons of 'set ha override enable'.

Pros.

In a normal situation, the cluster's Primary is the unit with the highest priority, so the Primary is always the same unit which makes it easier to identify.

Cons.

If the Primary fails and recovers, it triggers a double fail-over: The first one is normal because the other unit takes over. The second one, because it comes later up and takes priority because of the override enable. If this is something best to avoid, it is recommended to configure ha with 'set ha override disable'.
If the cluster is set up to monitor a certain link and that link is flapping only on one node, but stable on the other, then the failover will happen repeatedly, possibly cutting the network access entirely.

Once the original Primary has failed over in the 'set ha override disable' context.

How to force it to take the Primary role again:

When the ha cluster is configured with 'set ha override disable', if the original 'Active' unit fails and re-joins the cluster after recovery, it is expected to join with the 'Backup' role (unless the ha uptime difference between the 2 units is less than 5 minutes, see the ha guide for more details on this).

It may be desired to have the original Primary becoming Primary again.

This operation needs to be manually triggered by the administrator in a controlled timing (a maintenance window for instance).

To achieve this, the administrator has to connect to the current Primary CLI (console, telnet/ssh cli, GUI CLI) and issue the command 'diag sys ha reset-uptime'.

This would reset to 0 the current Primary internal HA uptime timer forcing the Secondary to have a higher HA uptime and therefore be promoted as a new Primary.

To illustrate this, an example:

Unit A.

Is the preferred Primary.
Has a priority of 200
Is configured with ha override disabled.

Unit B.

Is the preferred Secondary.
Has a priority of 100.
Is configured with ha override disabled.

Timeline:

t= 0 s : A and B are just booted.
- ha uptime difference is less than 5 minutes. As a consequence, the HA uptime difference is ignored in the Primary election process.
- A is promoted to Primary because its priority is higher than B (200>100).

t=1 mn: A is rebooted.
- A leaves the cluster but re-joins it as Primary after 2 minutes.
  This is expected because the HA uptime difference between A and B is less than 5 minutes.
- AS a result, the HA aging condition is ignored in the election algorithm (and A's priority trumps B's priority).
t= 15 mn: A is again rebooted.
- This time A rejoins the cluster as Secondary.
  Because HA uptime difference between A and B is greater than 5 minutes.
The status is now: B=Primary, A=Secondary.
t= later... in a maintenance window.
- The administrator wishes to have its preferred Primary A back as the cluster Primary.
- The administrator connects to B (current Primary ) CLI and issues the following command:

diag sys ha reset-uptime

This resets B's internal HA uptime making A the oldest one.
A is promoted Primary.
B is degraded to Secondary.

Note:

The default 5-minute 'ha uptime difference' limit is configurable.

It is possible to use 'set ha-uptime-diff-margin' for this (default remains 5 minutes).

The 5-minute timer can be lowered, but in case of a failure that triggers units to repeatedly fail over, it will possibly not allow enough time to access the unit and remedy the situation.

Reasons that there is a 5-minute difference in the Primary election process.

When two units are booted more or less at the same time, the one with the highest priority should be elected Primary.

This is needed to consistently have the same unit elected as Primary.

This is even desirable in a virtual cluster context where it is expected for the virtual cluster 2 to be Primary on the Secondary blade.

For this, a time limit is needed where the HA aging condition should be ignored.

How to check the difference between members:

diagnose sys ha dump-by group
'FGVM16TM24000014': ha_prio/o=0/0, link_failure=0, pingsvr_failure=0, flag=0x00000001, mem_failover=0, uptime/reset_cnt=407/0 <- '407' is a difference measured in seconds.
'FGVM16TM24000037': ha_prio/o=1/1, link_failure=0, pingsvr_failure=0, flag=0x00000000, mem_failover=0, uptime/reset_cnt=0/2 <----- '0' is for the device with the lowest HA uptime and '2' is the number of times HA uptime has been reset for this device.

The above shows how to identify the HA uptime difference between members. The member with 0 in the uptime column indicates the device with the lowest uptime. The example shows that the device with the serial number ending in 14 has an HA uptime that is 407 higher than that of the other device in the HA cluster. The reset_cnt column indicates the number of times the HA uptime has been reset for that device.

To confirm the ha override setting:

sh system ha | grep override
set override enable

Related articles:
Technical Tip: How to use failover flag to change Active unit

Technical Tip: Different options to trigger an HA failover (FGCP)

Technical Tip: Restoring HA Primary role after a failover using 'diag sys ha reset uptime' (ha 'set override disable' context)

You are leaving our website