FortiGate
FortiGate Next Generation Firewall utilizes purpose-built security processors and threat intelligence security services from FortiGuard labs to deliver top-rated protection and high performance, including encrypted traffic.
lcamilo
Staff
Staff

Description

 

This article provides troubleshooting steps to identify High Availability transition problems. 

Scope

 

FortiGate on High Availability clusters.

 

Solution

 

Keeping in mind how the FGCP election process works and is described here, there may be cases where it's necessary to collect the details to troubleshoot some expected or unexpected cluster transitions. This article will provide several commands to help with this process. 

This article assumes the override flag is disabled. See the handbook for details on when the override is enabled. (Primary Unit selection with override disabled.)

 

Start with the following console command:

 

# get system ha status

 

lcamilo_0-1667500477833.png

 

Pay attention to the information close to the top, which shows any warnings related to the cluster. 

Notice the last 4x HA historical events with timestamps, where the reasons for the last HA transitions are provided (there will be more events shown in the next command). 

Check if the cluster is "in sync" and when the last synchronization happened. 

Next, check the heartbeat interface counters for errors or status changes like "down" interfaces. 

Close to the bottom, confirm the Primary and Secondary unit's roles by the hostname. 

 

Next, check the history of the election process by running the following command: 

 

# diag sys ha history read

 

lcamilo_1-1667501431444.png

 

The history above is limited to 512 entries and is persistent to reboots. Each unit keeps track of its own history of events and while it can be cleared manually, it'll override the oldest events. Pay attention to 'link status changes' where 0=down and 1=up might trigger the election algorithm for monitored interfaces. LAG and aggregated interfaces are deemed 'down' if all LAG members go down. 

The LAG interface status behavior can be adjusted with the "min-links" described here.

 

Check Link monitor, interfaces and Age by running the following command:

 

# diag sys ha dump-by group

lcamilo_2-1667501636204.png

 

A few things to note:

  • linkfails=35 will show the total number of 'down' interfaces on that serial number.
  • The second highlighted line (2(work)) is another easy way to know if this unit is master state=2 or slave state=3.
  • mondev: will show brief information for each monitored device and its status (1=up and 0=down).
  • the link_failure count can also be checked from "diag sys ha dump-by vcluster"
  • cluster uptime is the last information is seen and (uptime cnt=35, reset 3) will show the cluster uptime along with the number of times it was reset. Uptime is reset when there's some mondev status change. 

 

When the system boots up and any monitored interfaces are down, the link_failure count will increment by 50 for each interface in the 'down'. For instance, if there are 3 interfaces currently down, link_failure will equal 150. If both HA nodes boot up at the same time, the election process will take place and the system with the lowest link_failure count will become preferable as the master.

 

While the cluster might select the unit that has the fewest monitored and failed interfaces while booting up, Age (uptime) will be only considered after the 'ha-uptime-diff-margin' (AKA 'grace time').


Age and link_failure will only trigger cluster transitions after the cluster boots up and has been up for more than the ha-uptime-diff-margin (which is 300 seconds, or 5 minutes, by default).

 

If the interface monitor's list is updated during the cluster operation the link_failure count will be reset to reflect the current monitored interface status (UP or Down). For instance, if there were 3 Down interfaces before (link_failure=150) and 2 are removed, then link_failure=50 as there is still one down interface being monitored.

 

# diag sys ha mac

 

Notice which interfaces are currently down (=1) and up (=0) on both cluster members. 

In HA active-passive, if the unit is subordinate, it won't have vmac information until it's master. 

lcamilo_0-1667841309828.png

 

To reset the uptime manually, run the following command:

 

# diag sys ha reset-uptime

 

When resetting the uptime manually, a cluster transition may occur. 

The 'diag sys ha history read' will log the following events: 

 

<timestamp> FG800D3916801158 is elected as the cluster primary of 2 member
<timestamp> user="admin" ui=ssh(10.10.10.1) msg="Reset HA uptime"

 

Also, 'diag sys ha dump-by group' or 'dump-by vcluster' will increment the 'reset_cnt' and also reset the uptime count to zero. 

 

'FG800D3916800747': ha_prio/o=1/1, link_failure=50, pingsvr_failure=0, flag=0x00000000, mem_failover=0, uptime/reset_cnt=0/4
'FG800D3916801158': ha_prio/o=0/0, link_failure=50, pingsvr_failure=0, flag=0x00000001, mem_failover=0, uptime/reset_cnt=349084/1

 

To reset health-status manually, run the following command:

 

# diag sys ha reset-health-status

 

This command will clear out error statuses related to other cluster members when they're removed or re-added. Read more details here.

 

Conclusion: 

Cluster transitions may occur under some operational circumstances or when manual changes are applied to the FortiGate HA settings or on network devices. Always re-run the test booklet after applying changes to ensure the designed topology is still working as expected. 

 

References: 

 

 

Contributors