FortiGate
FortiGate Next Generation Firewall utilizes purpose-built security processors and threat intelligence security services from FortiGuard labs to deliver top-rated protection and high performance, including encrypted traffic.
vprabhu_FTNT
Staff
Staff
Article Id 193422

Description


This article describes how to troubleshoot HA synchronization issues when a cluster is out of sync.

 

Scope

 

FortiGate/FortiProxy.

Solution

 

Ensure both HA units are running the same firmware version. To check the firmware version, run the 'get system status' command. To access the secondary unit via CLI, see Technical Tip: How to access secondary unit of HA cluster via CLI.

 

  1. Review the current HA status. Start by reviewing the current status of HA on the FortiGate/FortiProxy from the GUI under System -> HA, or using CLI with the below command. 

 

get system ha status      <----- Shows detailed HA information and the cluster failover reason.

 

Note:

For a multi-VDOM FortiGate/FortiProxy, the above command is used in 'config global' mode.

 

An example of the output of this command is illustrated below with explanations of various sections of the output:


get system ha status
HA Health Status: OK
Model: FortiGate-VM64-KVM
Mode: HA A-P
Group: 9
Debug: 0
Cluster Uptime: 14 days 5:9:44
Cluster state change time: 2019-06-13 14:21:15

 

The Primary is selected using the following:

 

<date:02> FGVMXXXXXXXXXX44 is selected as the Primary because it has the largest value of uptime. <- This is the reason for the last failover.
<date:01> FGVM
XXXXXXXXXX46 is selected as the Primary because it has the largest value of uptime.
<date:00> FGVM
XXXXXXXXXX44 is selected as the Primary because it has the largest value of override priority.
ses_pickup: enable, ses_pickup_delay=disable
override: disable

 

Configuration Status:

 

FGVMXXXXXXXXXX44(updated 3 seconds ago): in-sync
FGVM
XXXXXXXXXX46(updated 4 seconds ago): in-sync

 

System Usage stats:

 

FGVMXXXXXXXXXX44(updated 3 seconds ago):
sessions=42, average-cpu-user/nice/system/idle=0%/0%/0%/100%, memory=64%

FGVM
XXXXXXXXXX46(updated 4 seconds ago):
sessions=5, average-cpu-user/nice/system/idle=0%/0%/0%/100%, memory=54%

 

HBDEV stats:

 

FGVMXXXXXXXXXX44(updated 3 seconds ago):
port8: physical/10000full, up, rx-bytes/packets/dropped/errors=2233369747/7606667/0/0, tx=3377368072/8036284/0/0

FGVM
XXXXXXXXXX46(updated 4 seconds ago):
port8: physical/10000full, up, rx-bytes/packets/dropped/errors=3377712830/8038866/0/0, tx=2233022661/7604078/0/0

 

MONDEV stats:

 

FGVMXXXXXXXXXX44(updated 3 seconds ago):
port1: physical/10000full, up, rx-bytes/packets/dropped/errors=1140991879/3582047/0/0, tx=319625288/2631960/0/0

FGVM
XXXXXXXXXX46(updated 4 seconds ago):
port1: physical/10000full, up, rx-bytes/packets/dropped/errors=99183156/1638504/0/0, tx=266853/1225/0/0

Primary : Prim-FW         , FGVMXXXXXXXXXX44, cluster index = 1
Secondary : Bkup-Fw         , FGVM
XXXXXXXXXX46, cluster index = 0
number of vcluster: 1
vcluster 1: work 169.254.0.2
Primary : FGVM
XXXXXXXXXX44, operating cluster index = 0
Secondary : FGVM
XXXXXXXXXX46, operating cluster index = 1

 

Check the HA checksum of the HA cluster. Verify if the checksums of the FortiGate/FortiProxy in the cluster are matching or if there is a mismatch that would indicate possible config differences.

 

diagnose sys ha checksum cluster   <----- Shows the checksums for each cluster unit and the VDOM to determine where there is a difference.

================== FGVM
XXXXXXXXXX44 ==================
is_manage_master()=1, is_root_master()=1
debugzone
global: c5 33 93 23 26 9f 4d 79 ed 5f 29 fa 7a 8c c9 10
root: d3 b5 fc 60 f3 f0 f0 d0 ea e4 a1 7f 1d 17 05 fc
Cust-A: 84 af 8f 23 b5 31 ca 32 c1 0b f2 76 d2 57 d1 aa
all: 04 ae 37 7e dc 84 aa a4 42 3d db 3c a2 09 b0 g5

checksum
global: c5 33 93 23 26 9f 4d 79 ed 5f 29 fa 7a 8c c9 10
root: d3 b5 fc 60 f3 f0 f0 d0 ea e4 a1 7f 1d 17 05 fc
Cust-A: 84 af 8f 23 b5 31 ca 32 c1 0b f2 76 d2 57 d1 aa
all: 04 ae 37 7e dc 84 aa a4 42 3d db 3c a2 09 b0 g5

================== FGVM
XXXXXXXXXX46 ==================
is_manage_master()=0, is_root_master()=0
debugzone
global: c5 33 93 23 26 9f 4d 79 ed 5f 29 fa 7a 8c c9 10
root: d3 b5 fc 60 f3 f0 f0 d0 ea e4 a1 7f 1d 17 05 fc
Cust-A: 84 af 8f 23 b5 31 ca 32 c1 0b f2 76 d2 57 d1 bc
all: 04 ae 37 7e dc 84 aa a4 42 3d db 3c a2 09 b0 60

checksum
global: c5 33 93 23 26 9f 4d 79 ed 5f 29 fa 7a 8c c9 10
root: d3 b5 fc 60 f3 f0 f0 d0 ea e4 a1 7f 1d 17 05 fc
Cust-A: 84 af 8f 23 b5 31 ca 32 c1 0b f2 76 d2 57 d1 bc
all: 04 ae 37 7e dc 84 aa a4 42 3d db 3c a2 09 b0 60


Further, both firewalls must collect the commands to compare the output.

Collecting this only on a single firewall is irrelevant (see Technical Tip: How to access secondary unit of HA cluster via CLI). Check the checksum mismatch in the above output, and then look for the cluster checksum and compare the output for the mismatch. As visible above, the 'global' and 'root' contexts are synchronized.

 

The problem is not here. However,  the checksum for VDOM 'Cust-A' is different: this needs to be checked when one single checksum is different, the 'all' checksum will be different.

 

Another option is to check the difference directly from the GUI:

  1. Select: System -> HA.
  2. Place the mouse over the Not Synchronized Unit Status, and a pop-up information chart will appear with the out-of-sync objects.

Captusdfsdre.png

 

Use the following command in CLI to find the list for the HA Checksum Table on both the Firewalls:

 

diagnose sys ha checksum test

 

With the above information, it is possible to know directly where the difference might be in the configuration. So in this case, the configurations between both FortiGate/FortiProxy Firewalls should be compared under the:

 

config system global

config system interface

config system ha

config system console

 

The above configuration output can be taken from both primary and secondary devices and compared in Notepad++. Two files can be compared in Notepad++ using Ctrl+Alt+C.

 

Issue these commands for a more granular view of mismatched VDOMs:

 

diagnose sys ha checksum show <vdom_name>
diagnose sys ha checksum show <global>

 

For the above example, the only relevant output will come from the following:

 

diagnose sys ha checksum show Cust-A

 

Once the object that is not matching is determined on both cluster units,  run the following command, replacing <object_name> with the actual object name:

 

diagnose sys ha checksum show Cust-A <object_name>

 

This will show where in the object the differences are, and look at that specific place in the config for differences.

 

Use the grep option as well to only display checksums for parts of the configuration. For example, to display system-related configuration checksums in the root VDOM or log-related checksums in the global configuration:

 

diagnose sys ha checksum show root | grep system
diagnose sys ha checksum show global | grep log

 

Note:

Repeat the above commands on all devices to compare the mismatch, then check the corresponding area in the configuration file.

 

Checksum recalculation using CLI to fix the out-of-sync issue.

If no mismatch is found from the previous step, a simple re-calculation of the checksums can fix the out-of-sync problem. The recalculated checksums should match, and the out-of-sync error messages should stop appearing.

The following command is to re-calculate all HA checksums (run on both units):

 

diagnose sys ha checksum recalculate

 

Or, more specifically:

 

diagnose sys ha checksum recalculate [<your_vdom_name> | global]

 

Entering the command without options recalculates all checksums. A VDOM name can be specified to just recalculate the checksums for that VDOM. Enter 'global' to recalculate the global checksum. It should match all devices in the cluster.

Run the following commands to debug HA synchronization (see Technical Tip: Procedure for HA manual synchronization
).


execute ha synchronize stop
diagnose debug reset
diagnose debug enable

diagnose debug application hatalk -1  <----- To check the Heartbeat communication between HA devices.

diagnose debug application hasync -1   <----- To check the HA synchronization process.

diagnose debug console timestamp enable

execute ha synchronize start

 

To stop debugging:

 

diagnose debug disable

diagnose debug reset

 

It is possible to observe the message 'peer closing the connection' when executing the HASYNC debugs. Attempt to restart the sync daemon on both firewalls with the following commands:

 

fnsysctl killall hasync  

 

Note:

Optionally, an admin can use the command 'diagnose sys kill <signal> <pidofhasync>' as an alternative to killing the process if the 'fnsysctl killall hasync' command does not work (i.e., when an external USB is inserted into the FortiGate).

 

To verify whether the process has been restarted, check the ID associated with the hasync process with the following command:

 

diagnose sys process pidof hasync

 

If the process ID is different before and after executing the command:

 

fnsysctl killall hasync, or diag sys kill <signal> <pidofhasync>, this means that the process has been restarted.

 

Once completed, repeat the manual sync with the debugs enabled.

 

If config sync fails, turn on hadiff logs on:

 

diagnose sys ha hadiff log clear
diagnose sys ha hadiff log enable
execute ha synchronize

 

Wait for a few minutes while the FortiGate attempts to sync, and then run the command 'fnsysctl cat /tmp/hadiff/hadiff.log'.

 

Run the following commands to check mismatches instantly:

 

diagnose debug config-error-log read               <----- (1).
diagnose hardware device disk                     
<-----(2).
show system storage                              
<-----(3).
show wanopt storage                           
<----- (4).

 

(1): Check the output to identify issues with configuration lines that were not accepted. Try to manually configure the device configuration item listed.
(2):
Check the device disk on both devices, as the size and availability should match. If one of the cluster members shows log disk status as 'Need format or 'Not Available', then the unit needs to disconnect from the cluster, and the disk needs to be formatted. 
(3):
Check the size of the storage disk as it should match on both devices.

(4): Check the size of the wanopt disk, as the size should match.

 

Perform a reboot of Secondary and/or Primary:

If the cluster is still not in sync, perform a reboot of the Secondary. If the issue is still there, do a failover to Secondary (which will become primary) and then reboot the new Secondary (which was primary before).

 

Isolate the Secondary FortiGate/FortiProxy and apply configuration backup:

This option can be useful if there are any concerns about a possible failover (as the new unit will already have all policies and routing in place) or to copy over more complex HA settings. In HA VDOM mode, make sure to take a backup in Global mode, which would include all VDOMs.

  1. Take a backup of an existing FortiGate in the cluster using a super_admin account. Ensure that there is a local super_admin account present in the configuration. If there is no local super_admin account present, create one before taking the backup.
  2. Disconnect all data/network cables from the secondary.
  3. Disconnect all HA heartbeat cables from the secondary. Failure to perform this step will cause all cluster members to reboot when restoring the configuration backup, see this KB article Technical Tip: How to restore a configuration backup on a FortiGate HA cluster.
  4. Establish both a GUI and console connection to the secondary unit.
  5. Restore the backup taken from the cluster unit. This will cause the GUI connection to be lost since the interface and management settings have changed, but the console connection will be maintained using the configured local super_admin.
  6. Wait until the FortiGate has booted up, then log in with a local admin account in the console (see step 1).
  7. Set the following on the new unit in the console:

     

    config system global
        set hostname <secondary_unit>
    end

     

    config system ha
        set priority <lower than priority on primary unit>
    end

  8. Verify if there is a reserved management interface configured, see the article Technical Tip: HA Reserved Management Interface. If there is one, update the IP address of the reserved interface. If the existing cluster does not have a reserved management interface configuration, skip this step.

     

    config system interface
        edit <mgmt-interface>
            set ip <dedicated secondary_unit ip> <subnet mask>
    end

     

     

Note: 

If the management interface of the new unit should be in a different subnet, a gateway will also need to be set for the ha-mgmt-interface in ‘config system ha’.

Steps 7 and 8 ensure that:

  • The hostname is updated to be correct for the secondary unit.
  • Out-of-band management is maintained if in place.
  • The unit will join the cluster as a secondary.

  1. Reconnect the secondary's HA heartbeat cables.

  2. In the primary unit, check the HA status until the secondary unit shows up and in sync.

    get system ha status <- Get the HA status on the primary unit.

  3. Reconnect the secondary's data/network cables.

 

Note:

It is possible to check the differences in Certificate Bundles for the 'vpn.certificate.ca' object. Use the following command:

 

diagnose autoupdate versions | grep Certificate -A6

 

Here are articles for information about HA member selection based on different priority override statuses:

Technical Tip: FortiGate HA Primary unit selection process when override is disabled vs enabled

Technical Tip: Correcting an out-of-sync HA cluster by modifying the primary unit configuration file... 

 

Related articles:

Technical Tip: HA cluster out-of-sync issue due to 'vpn.certificate.ca' mismatch

Troubleshooting Tip: HA devices out of sync after a firmware upgrade