Troubleshooting Tip: How to troubleshoot HA synchronization issue using GUI and CLI on FortiGate/FortiProxy

vprabhu_FTNT · ‎06-26-2019

Description

This article describes how to troubleshoot HA synchronization issues when a cluster is out of sync.

Scope

FortiGate/FortiProxy.

Solution

Make sure both HA units are running on the same firmware version. To check the firmware version, run this command 'get system status'. To access the secondary unit via CLI, see Technical Tip: How to access secondary unit of HA cluster via CLI.

Review the current HA status. Start by reviewing the current status of HA on the FortiGate/FortiProxy from the GUI under System -> HA, or using CLI with the below command.

FortiGate# get system ha status <----- Shows detailed HA information and the cluster failover reason.

Note:

For a multi-VDOM FortiGate/FortiProxy, the above command is used in 'config global' mode.

An example of the output of this command is illustrated below with explanations of various sections of the output:

get sys ha status
HA Health Status: OK
Model: FortiGate-VM64-KVM
Mode: HA A-P
Group: 9
Debug: 0
Cluster Uptime: 14 days 5:9:44
Cluster state change time: 2019-06-13 14:21:15

The Primary is selected using the following:

<date:02> FGVMXXXXXXXXXX44 is selected as the Primary because it has the largest value of uptime. <- This is the reason for the last failover.
<date:01> FGVMXXXXXXXXXX46 is selected as the Primary because it has the largest value of uptime.
<date:00> FGVMXXXXXXXXXX44 is selected as the Primary because it has the largest value of override priority.
ses_pickup: enable, ses_pickup_delay=disable
override: disable

Configuration Status:

FGVMXXXXXXXXXX44(updated 3 seconds ago): in-sync
FGVMXXXXXXXXXX46(updated 4 seconds ago): in-sync

System Usage stats:

FGVMXXXXXXXXXX44(updated 3 seconds ago):
sessions=42, average-cpu-user/nice/system/idle=0%/0%/0%/100%, memory=64%

FGVMXXXXXXXXXX46(updated 4 seconds ago):
sessions=5, average-cpu-user/nice/system/idle=0%/0%/0%/100%, memory=54%

HBDEV stats:

FGVMXXXXXXXXXX44(updated 3 seconds ago):
port8: physical/10000full, up, rx-bytes/packets/dropped/errors=2233369747/7606667/0/0, tx=3377368072/8036284/0/0

FGVMXXXXXXXXXX46(updated 4 seconds ago):
port8: physical/10000full, up, rx-bytes/packets/dropped/errors=3377712830/8038866/0/0, tx=2233022661/7604078/0/0

MONDEV stats:

FGVMXXXXXXXXXX44(updated 3 seconds ago):
port1: physical/10000full, up, rx-bytes/packets/dropped/errors=1140991879/3582047/0/0, tx=319625288/2631960/0/0

FGVMXXXXXXXXXX46(updated 4 seconds ago):
port1: physical/10000full, up, rx-bytes/packets/dropped/errors=99183156/1638504/0/0, tx=266853/1225/0/0

Primary : Prim-FW , FGVMXXXXXXXXXX44, cluster index = 1
Secondary : Bkup-Fw , FGVMXXXXXXXXXX46, cluster index = 0
number of vcluster: 1
vcluster 1: work 169.254.0.2
Primary : FGVMXXXXXXXXXX44, operating cluster index = 0
Secondary : FGVMXXXXXXXXXX46, operating cluster index = 1

Check the HA checksum of the HA cluster. Verify if the checksums of the FortiGate/FortiProxy in the cluster are matching or if there is a mismatch that would indicate possible config differences.

diag sys ha checksum cluster <----- Shows the checksums for each cluster unit and the VDOM to determine where there is a difference.

================== FGVMXXXXXXXXXX44 ==================
is_manage_master()=1, is_root_master()=1
debugzone
global: c5 33 93 23 26 9f 4d 79 ed 5f 29 fa 7a 8c c9 10
root: d3 b5 fc 60 f3 f0 f0 d0 ea e4 a1 7f 1d 17 05 fc
Cust-A: 84 af 8f 23 b5 31 ca 32 c1 0b f2 76 d2 57 d1 aa
all: 04 ae 37 7e dc 84 aa a4 42 3d db 3c a2 09 b0 g5

checksum
global: c5 33 93 23 26 9f 4d 79 ed 5f 29 fa 7a 8c c9 10
root: d3 b5 fc 60 f3 f0 f0 d0 ea e4 a1 7f 1d 17 05 fc
Cust-A: 84 af 8f 23 b5 31 ca 32 c1 0b f2 76 d2 57 d1 aa
all: 04 ae 37 7e dc 84 aa a4 42 3d db 3c a2 09 b0 g5

================== FGVMXXXXXXXXXX46 ==================
is_manage_master()=0, is_root_master()=0
debugzone
global: c5 33 93 23 26 9f 4d 79 ed 5f 29 fa 7a 8c c9 10
root: d3 b5 fc 60 f3 f0 f0 d0 ea e4 a1 7f 1d 17 05 fc
Cust-A: 84 af 8f 23 b5 31 ca 32 c1 0b f2 76 d2 57 d1 bc
all: 04 ae 37 7e dc 84 aa a4 42 3d db 3c a2 09 b0 60

checksum
global: c5 33 93 23 26 9f 4d 79 ed 5f 29 fa 7a 8c c9 10
root: d3 b5 fc 60 f3 f0 f0 d0 ea e4 a1 7f 1d 17 05 fc
Cust-A: 84 af 8f 23 b5 31 ca 32 c1 0b f2 76 d2 57 d1 bc
all: 04 ae 37 7e dc 84 aa a4 42 3d db 3c a2 09 b0 60

Further, the commands must be collected on both firewalls to compare the output.

Collecting this only on a single firewall is not relevant (see How to access the second firewall). Check the checksum mismatch in the above output, and then look for the cluster checksum and compare the output for the mismatch. As visible above, the 'global' and 'root' contexts are synchronized.

The problem is not here. However, the checksum for VDOM 'Cust-A' is different: this needs to be checked.

When one single checksum is different, the 'all' checksum will be different.

Another option is to check the difference directly from the GUI:

Select: System -> HA.
Place the mouse over the Not Synchronized Unit Status, and a pop-up information chart will appear with the out-of-sync objects.

Use the following command in CLI to find the list for the HA Checksum Table on both the Firewall:

diagnose sys ha checksum test

With the above information, it is possible to know directly where the difference might be in configuration. So in this case, the configurations between both FortiGate/FortiProxy Firewalls should be compared under the:

config system global

config system interface

config system ha

config system console

The above configuration output can be taken from both primary and secondary devices and compared on Notepad++. Two files can be compared on Notepad++ using Ctrl+Alt+C.

Issue these commands for a more granular view of mismatched VDOMs:

diag sys ha checksum show <vdom_name>
diag sys ha checksum show <global>

For the above example, the only relevant output will come from the following:

diag sys ha checksum show Cust-A

Once the object that is not matching is determined on both cluster units, run the following command, replacing <object_name> with the actual object name:

diag sys ha checksum show Cust-A <object_name>

This will show where in the object the differences are and look at that specific place in the config for differences.

Use the grep option as well to only display checksums for parts of the configuration. For example, to display system-related configuration checksums in the root VDOM or log-related checksums in the global configuration:

diagnose sys ha checksum show root | grep system
diagnose sys ha checksum show global | grep log

Note:

Repeat the above commands on all devices to compare the mismatch, then check the corresponding area in the configuration file.

Checksum recalculation using CLI to fix the out-of-sync issue.

If no mismatch is found from the previous step, a simple re-calculation of the checksums can fix the out-of-sync problem. The re-calculated checksums should match and the out-of-sync error messages should stop appearing.

The following command is to re-calculate all HA checksums (run on both units):

diagnose sys ha checksum recalculate

Or, more specific:

diagnose sys ha checksum recalculate [<your_vdom_name> | global]

Entering the command without options recalculates all checksums. A VDOM name can be specified to just recalculate the checksums for that VDOM. Enter 'global' to recalculate the global checksum. It should match all devices in the cluster.

Run the following commands to debug HA synchronization (see Manual synchronization).

diag debug app hasync 255
diag debug enable
execute ha synchronize start

diagnose debug application hatalk -1 <----- To check the Heartbeat communication between HA devices.

diagnose debug application hasync -1 <----- To check the HA synchronization process.

To stop Debugs:

diag debug disable

diag de reset

It is possible to observe the message 'peer closing the connection' when executing the HASYNC debugs. Attempt to restart the sync daemon on both firewalls with the following commands:

fnsysctl killall hasync

Note:

Optionally, an admin can use the command 'diag sys kill <signal> <pidofhasync>' as an alternative to killing the process if the fnsysctl killall hasync does not work (i.e. when an external USB is inserted into the FortiGate).

To verify whether the process has been restarted, check the ID associated with the hasync process with the following command:

diagnose sys process pidof hasync

If the process ID is different before and after executing the command:

fnsysctl killall hasync, or diag sys kill <signal> <pidofhasync>, this means that the process has been restarted.

Once completed, repeat the manual sync with the debugs enabled.

Run the following commands to check mismatches instantly:

diag debug config-error-log read             <----- (1).
diag hardware device disk    <-----(2).
show sys storage                           <-----(3).
show wanopt storage          <----- (4).

(1): Check the output to identify issues with configuration lines that were not accepted. Try to manually configure the device configuration item listed.
(2): Check the device disk on both devices as the size and availability should match. If one of the cluster members shows log disk status as 'Need format or 'Not Available', then the unit needs to disconnect from the cluster, and the disk needs to be formatted.
(3): Check the size of the storage disk as it should match on both devices.

(4): Check the size of wanopt disk as the size should match.

Perform a reboot of Secondary and/or Primary.

If the cluster is still not in sync, perform a reboot of the Secondary. If the issue is still there, do a failover to Secondary (which will become primary) and then reboot the new Secondary (which was primary before).

Isolate the Secondary FortiGate/FortiProxy and rebuild the HA config.

If the cluster is still not in sync, isolate the Secondary FortiGate/FortiProxy from the cluster. This process will require physical access to the FortiGate/FortiProxy.

Note:

Before starting this process, take a backup of the FortiGate/FortiProxy configuration.
Only take a backup using a Super Admin account or the HA will not be in sync because to backup taken with another admin profile will not contain the Super Admin account.

Step 1: Disconnect all network cables from the secondary unit except for the heartbeat cables.

Step 2: Disconnect the heartbeat cable. This will disconnect the secondary FortiGate/FortiProxy from the network.

Step 3: Connect to the secondary FortiGate/FortiProxy using the console and perform a factory reset:

execute factoryreset

See Technical Tip: How to reset a FortiGate with the default factory settings/without losing management ... for detailed instructions.

Step 4: After the FortiGate/FortiProxy comes back online, login again and configure the HA settings. Make sure to keep the priority low for the secondary FortiGate/FortiProxy in HA settings (lower than the primary FortiGate/FortiProxy).

Step 5: After that is configured, connect the HA cable to the heartbeat interface of the secondary. Do not connect any other cables at this time. The secondary FortiGate/FortiProxy should show up in the HA. If the secondary FortiGate/FortiProxy does not show up in HA settings, do not proceed to the next step.

Step 6: The secondary FortiGate/FortiProxy should have joined the secondary role. After verifying that this has happened (using GUI or CLI of primary), connect all of the other network cables to the secondary FortiGate/FortiProxy as per the previous setup.

Step 7: Verify the status of the configuration sync from Primary FortiGate/FortiProxy - it should show that both Primary and secondary units are in sync.

Note:

In case of setting up the HA cluster members to choose primary based on priority value first over the HA up time and in the case of the HA cluster out of sync, the following steps can be a solution to get the cluster back in sync:

Make sure that both members of the cluster have priority override value enabled:

config sys ha

set override enable

end

Change the priority for the secondary member - usually would the member out-of-sync- to be higher than that of the primary member of the HA cluster:

config sys ha

set priority 200 <----- Assuming the current HA primary has a value below 200.

end

Check that a failover does take place for example when the web admin page or ssh session to the firewall shows the connection is lost.
After the secondary member is selected it should take a few minutes for it to start the synchronization with the new secondary.

Note:

It is possible to check the differences in Certificate Bundles for the 'vpn.certificate.ca' object. Use the following command:

diagnose autoupdate versions | grep Certificate -A6"

Here are articles for information about HA member selection based on different priority override statuses:

Technical Tip: FortiGate HA Primary unit selection process when override is disabled vs enabled

Technical Tip: Correcting an out-of-sync HA cluster by modifying the primary unit configuration file...

Related article:

Technical Tip: HA cluster out-of-sync issue due to 'vpn.certificate.ca' mismatch

Troubleshooting Tip: How to troubleshoot HA synchronization issue using GUI and CLI on FortiGate/FortiProxy

You are leaving our website