Technical Tip: Restoring HA Primary role after a failover using 'diagnose sys ha reset uptime' (ha 'set override disable' context)
Description
Scope
FortiGate.
Solution
In a normal situation, the cluster's Primary is the unit with the highest priority, so the Primary is always the same unit, which makes it easier to identify.
- If the Primary fails and recovers, it triggers a double failover: The first one is normal because the other unit takes over. The second one comes later and takes priority because of the override enabled. If this is something best to avoid, it is recommended to configure HA with 'set ha override disable'.
- If the cluster is set up to monitor a certain link and that link is flapping only on one node, but stable on the other, then the failover will happen repeatedly, possibly cutting the network access entirely.
- Is the preferred Primary.
- Has a priority of 200.
- It is configured with HA override disabled.
- Is the preferred Secondary.
- Has a priority of 100.
- It is configured with HA override disabled.
- t= 0 s: A and B are just booted.
- The ha uptime difference is less than 5 minutes. As a consequence, the HA uptime difference is ignored in the Primary election process.
- A is promoted to Primary because its priority is higher than B (200>100).
- t=1 mn: A is rebooted.
- A leaves the cluster but re-joins it as Primary after 2 minutes.
This is expected because the HA uptime difference between A and B is less than 5 minutes. - As a result, the HA aging condition is ignored in the election algorithm (and A's priority trumps B's priority).
- A leaves the cluster but re-joins it as Primary after 2 minutes.
- t= 15 mn: A is again rebooted.
This time, A rejoins the cluster as Secondary because the HA uptime difference between A and B is greater than 5 minutes. - The status is now: B=Primary, A=Secondary.
- t= later... in a maintenance window.
- The administrator wishes to have the preferred Primary A back as the cluster Primary.
- The administrator connects to B (current Primary ) CLI and issues the following command:
diagnose sys ha reset-uptime
- This resets B's internal HA uptime, making A the oldest one.
- A is promoted to Primary.
- B is degraded to Secondary.
How to check the difference between members:
diagnose sys ha dump-by group
'FGVM16TM24000014': ha_prio/o=0/0, link_failure=0, pingsvr_failure=0, flag=0x00000001, mem_failover=0, uptime/reset_cnt=407/0 <- '407' is a difference measured in seconds.
'FGVM16TM24000037': ha_prio/o=1/1, link_failure=0, pingsvr_failure=0, flag=0x00000000, mem_failover=0, uptime/reset_cnt=0/2 <----- '0' is for the device with the lowest HA uptime and '2' is the number of times HA uptime has been reset for this device.
The above shows how to identify the HA uptime difference between members. The member with 0 in the uptime column indicates the device with the lowest uptime. The example shows that the device with the serial number ending in 14 has an HA uptime that is 407 higher than that of the other device in the HA cluster. The reset_cnt column indicates the number of times the HA uptime has been reset for that device.
To confirm the ha override setting:
show system ha | grep override
set override enable
- 'reset-uptime' will reset the HA uptime on the unit where this command is run.
- 'reset-uptime-primary-only' will take effect only if it is applied on the primary unit.
The following article discusses this command in detail: Troubleshooting Tip: FortiGate behavior when executing the command 'diagnose sys ha reset-uptime-primary-only' on HA units.
Related articles:
Technical Tip: How to use failover flag to change Active unit
Technical Tip: Different options to trigger an HA failover (FGCP)
Technical Tip: FortiGate HA Primary unit selection process when override is disabled vs enabled
