Technical Tip: Troubleshooting unexpected High Availability (HA) failover

lcamilo · ‎11-07-2022

Description

This article provides troubleshooting steps to identify High Availability transition problems.

Scope

FortiGate on High Availability clusters.

Solution

Keeping in mind how the FGCP election process works and is described: Primary unit selection with override disabled (default), there may be cases where it's necessary to collect the details to troubleshoot some expected or unexpected cluster transitions. This article will provide several commands to help with this process.

This article assumes the override flag is disabled. See the handbook for details on when the override is enabled (Primary Unit selection with override disabled).

Below mentioned all commands work under the 'config system global' in the VDOM environment

Start with the following console command:

get system ha status

Pay attention to the information close to the top, which shows any warnings related to the cluster.

Notice the last 4x HA historical events with timestamps, where the reasons for the last HA transitions are provided (there will be more events shown in the next command).

Check if the cluster is "in sync" and when the last synchronization happened.

Check the heartbeat interface counters for errors or status changes like 'down' interfaces. Close to the bottom, confirm the Primary and Secondary unit's roles by the hostname.

Check the history of the election process by running the following command:

diag sys ha history read

The history above is limited to 512 entries and is persistent to reboots. Each unit keeps track of its own history of events and while it can be cleared manually, it'll override the oldest events.

Pay attention to 'link status changes' where 0=down and 1=up might trigger the election algorithm for monitored interfaces. LAG and aggregated interfaces are deemed 'down' if all LAG members go down.

The LAG interface status behavior can be adjusted with the 'min-links' described here.

Check Link monitor, interfaces, and Age by running the following command:

diag sys ha dump-by group

Note:

linkfails=35 will show the total number of 'down' interfaces on that serial number.
The second highlighted line (2(work)) is another easy way to know if this unit is master state=2 or slave state=3.
mondev: will show brief information for each monitored device and its status (1=up and 0=down).
the link_failure count can also be checked from 'diag sys ha dump-by vcluster'.
cluster uptime is the last information is seen and (uptime cnt=35, reset 3) will show the cluster uptime along with the number of times it was reset. Uptime is reset when there's some mondev status change.

When the system boots up and any monitored interfaces are down, the link_failure count will increment by 50 for each interface in the 'down'. For instance, if there are 3 interfaces currently down, link_failure will equal 150. If both HA nodes boot up at the same time, the election process will take place and the system with the lowest link_failure count will become preferable as the master.

While the cluster might select the unit that has the fewest monitored and failed interfaces while booting up, Age (uptime) will be only considered after the 'ha-uptime-diff-margin' (AKA 'grace time').

Age and link_failure will only trigger cluster transitions after the cluster boots up and has been up for more than the ha-uptime-diff-margin (which is 300 seconds, or 5 minutes, by default).

If the interface monitor's list is updated during the cluster operation the link_failure count will be reset to reflect the current monitored interface status (UP or Down). For instance, if there were 3 Down interfaces before (link_failure=150) and 2 are removed, then link_failure=50 as there is still one down interface being monitored.

diag sys ha mac

Notice which interfaces are currently down (=1) and up (=0) on both cluster members.

In HA active-passive, if the unit is subordinate, it won't have vmac information until it's master.

To reset the uptime manually, run the following command:

diag sys ha reset-uptime

When resetting the uptime manually, a cluster transition may occur.

The 'diag sys ha history read' will log the following events:

<timestamp> FG800D3916801158 is elected as the cluster primary of 2 member
<timestamp> user="admin" ui=ssh(10.10.10.1) msg="Reset HA uptime"

Also, 'diag sys ha dump-by group' or 'dump-by vcluster' will increment the 'reset_cnt' and also reset the uptime count to zero.

'FG800D3916800747': ha_prio/o=1/1, link_failure=50, pingsvr_failure=0, flag=0x00000000, mem_failover=0, uptime/reset_cnt=0/4
'FG800D3916801158': ha_prio/o=0/0, link_failure=50, pingsvr_failure=0, flag=0x00000001, mem_failover=0, uptime/reset_cnt=349084/1

To reset health-status manually, run the following command:

# diag sys ha reset-health-status

This command will clear out error statuses related to other cluster members when they're removed or re-added. Read more details here.

Note:

If there are no redundant WAN links, internet access can be interrupted during the impact window.
It is possible to see the event logs for WAN flapping if that is the only interface that is set as a monitored interface in the HA configuration.
It is possible to see the events in system event logs as well as in the output of the 'diag sys ha history read'.

WAN Flapping.png

Conclusion:

Cluster transitions may occur under some operational circumstances or when manual changes are applied to the FortiGate HA settings or on network devices. Always re-run the test booklet after applying changes to ensure the designed topology is still working as expected.

Related documents:

Primary Unit selection with override disabled

Primary Unit selection with override enabled

FOS 7.2.2 - LAG min-links cli guide

diag sys ha reset-health-status

Technical Tip: Troubleshooting unexpected High Availability (HA) failover

You are leaving our website