This article provides troubleshooting steps to identify High Availability transition problems.
FortiGate on High Availability clusters.
Keeping in mind how the FGCP election process works and is described here, there may be cases where it's necessary to collect the details to troubleshoot some expected or unexpected cluster transitions. This article will provide several commands to help with this process.
Start with the following console command:
# get system ha status
Pay attention to the information close to the top, which shows any warnings related to the cluster.
Notice the last 4x HA historical events with timestamps, where the reasons for the last HA transitions are provided (there will be more events shown in the next command).
Check if the cluster is "in sync" and when the last synchronization happened.
Next, check the heartbeat interface counters for errors or status changes like "down" interfaces.
Close to the bottom, confirm the Primary and Secondary unit's roles by the hostname.
Next, check the history of the election process by running the following command:
# diag sys ha history read
The history above is limited to 512 entries and is persistent to reboots. Each unit keeps track of its own history of events and while it can be cleared manually, it'll override the oldest events. Pay attention to 'link status changes' where 0=down and 1=up might trigger the election algorithm for monitored interfaces. LAG and aggregated interfaces are deemed 'down' if all LAG members go down.
Check Link monitor, interfaces and Age by running the following command:
# diag sys ha dump-by group
A few things to note:
When the system boots up and any monitored interfaces are down, the link_failure count will increment by 50 for each interface in the 'down'. For instance, if there are 3 interfaces currently down, link_failure will equal 150. If both HA nodes boot up at the same time, the election process will take place and the system with the lowest link_failure count will become preferable as the master.
While the cluster might select the unit that has the fewest monitored and failed interfaces while booting up, Age (uptime) will be only considered after the 'ha-uptime-diff-margin' (AKA 'grace time').
Age and link_failure will only trigger cluster transitions after the cluster boots up and has been up for more than the ha-uptime-diff-margin (which is 300 seconds, or 5 minutes, by default).
If the interface monitor's list is updated during the cluster operation the link_failure count will be reset to reflect the current monitored interface status (UP or Down). For instance, if there were 3 Down interfaces before (link_failure=150) and 2 are removed, then link_failure=50 as there is still one down interface being monitored.
# diag sys ha mac
Notice which interfaces are currently down (=1) and up (=0) on both cluster members.
In HA active-passive, if the unit is subordinate, it won't have vmac information until it's master.
To reset the uptime manually, run the following command:
# diag sys ha reset-uptime
When resetting the uptime manually, a cluster transition may occur.
The 'diag sys ha history read' will log the following events:
<timestamp> FG800D3916801158 is elected as the cluster primary of 2 member
<timestamp> user="admin" ui=ssh(10.10.10.1) msg="Reset HA uptime"
Also, 'diag sys ha dump-by group' or 'dump-by vcluster' will increment the 'reset_cnt' and also reset the uptime count to zero.
'FG800D3916800747': ha_prio/o=1/1, link_failure=50, pingsvr_failure=0, flag=0x00000000, mem_failover=0, uptime/reset_cnt=0/4
'FG800D3916801158': ha_prio/o=0/0, link_failure=50, pingsvr_failure=0, flag=0x00000001, mem_failover=0, uptime/reset_cnt=349084/1
To reset health-status manually, run the following command:
# diag sys ha reset-health-status
This command will clear out error statuses related to other cluster members when they're removed or re-added. Read more details here.
Cluster transitions may occur under some operational circumstances or when manual changes are applied to the FortiGate HA settings or on network devices. Always re-run the test booklet after applying changes to ensure the designed topology is still working as expected.