Created on
06-26-2019
05:41 AM
Edited on
09-19-2023
09:08 AM
By
Anthony_E
Description
This article describes how to troubleshoot HA synchronization issue when a cluster is out of sync.
Scope
FortiGate.
Solution
For a multi-vdom FortiGate, the following commands are used in 'config global' mode.
get system ha status <----- Shows detailed HA information and cluster failover reason.
Prim-FW (global) # get sys ha status
HA Health Status: OK
Model: FortiGate-VM64-KVM
Mode: HA A-P
Group: 9
Debug: 0
Cluster Uptime: 14 days 5:9:44
Cluster state change time: 2019-06-13 14:21:15
Master selected using:
<date:02> FGVMXXXXXXXXXX44 is selected as the master because it has the largest value of uptime. <--- this is the reason for last failover
<date:01> FGVMXXXXXXXXXX46 is selected as the master because it has the largest value of uptime.
<date:00> FGVMXXXXXXXXXX44 is selected as the master because it has the largest value of override priority.
ses_pickup: enable, ses_pickup_delay=disable
override: disable
Configuration Status:
FGVMXXXXXXXXXX44(updated 3 seconds ago): in-sync
FGVMXXXXXXXXXX46(updated 4 seconds ago): in-sync
System Usage stats:
FGVMXXXXXXXXXX44(updated 3 seconds ago):
sessions=42, average-cpu-user/nice/system/idle=0%/0%/0%/100%, memory=64%
FGVMXXXXXXXXXX46(updated 4 seconds ago):
sessions=5, average-cpu-user/nice/system/idle=0%/0%/0%/100%, memory=54%
HBDEV stats:
FGVMXXXXXXXXXX44(updated 3 seconds ago):
port8: physical/10000full, up, rx-bytes/packets/dropped/errors=2233369747/7606667/0/0, tx=3377368072/8036284/0/0
FGVMXXXXXXXXXX46(updated 4 seconds ago):
port8: physical/10000full, up, rx-bytes/packets/dropped/errors=3377712830/8038866/0/0, tx=2233022661/7604078/0/0
MONDEV stats:
FGVMXXXXXXXXXX44(updated 3 seconds ago):
port1: physical/10000full, up, rx-bytes/packets/dropped/errors=1140991879/3582047/0/0, tx=319625288/2631960/0/0
FGVMXXXXXXXXXX46(updated 4 seconds ago):
port1: physical/10000full, up, rx-bytes/packets/dropped/errors=99183156/1638504/0/0, tx=266853/1225/0/0
Master: Prim-FW , FGVMXXXXXXXXXX44, cluster index = 1
Slave : Bkup-Fw , FGVMXXXXXXXXXX46, cluster index = 0
number of vcluster: 1
vcluster 1: work 169.254.0.2
Master: FGVMXXXXXXXXXX44, operating cluster index = 0
Slave : FGVMXXXXXXXXXX46, operating cluster index = 1
Prim-FW(global)# diag sys ha checksum cluster <--- Shows the checksums for each cluster unit and the VDOM in order to determine where there is a difference.
================== FGVMXXXXXXXXXX44 ==================
is_manage_master()=1, is_root_master()=1
debugzone
global: c5 33 93 23 26 9f 4d 79 ed 5f 29 fa 7a 8c c9 10
root: d3 b5 fc 60 f3 f0 f0 d0 ea e4 a1 7f 1d 17 05 fc
Cust-A: 84 af 8f 23 b5 31 ca 32 c1 0b f2 76 d2 57 d1 aa
all: 04 ae 37 7e dc 84 aa a4 42 3d db 3c a2 09 b0 g5
checksum
global: c5 33 93 23 26 9f 4d 79 ed 5f 29 fa 7a 8c c9 10
root: d3 b5 fc 60 f3 f0 f0 d0 ea e4 a1 7f 1d 17 05 fc
Cust-A: 84 af 8f 23 b5 31 ca 32 c1 0b f2 76 d2 57 d1 aa
all: 04 ae 37 7e dc 84 aa a4 42 3d db 3c a2 09 b0 g5
================== FGVMXXXXXXXXXX46 ==================
is_manage_master()=0, is_root_master()=0
debugzone
global: c5 33 93 23 26 9f 4d 79 ed 5f 29 fa 7a 8c c9 10
root: d3 b5 fc 60 f3 f0 f0 d0 ea e4 a1 7f 1d 17 05 fc
Cust-A: 84 af 8f 23 b5 31 ca 32 c1 0b f2 76 d2 57 d1 bc
all: 04 ae 37 7e dc 84 aa a4 42 3d db 3c a2 09 b0 60
checksum
global: c5 33 93 23 26 9f 4d 79 ed 5f 29 fa 7a 8c c9 10
root: d3 b5 fc 60 f3 f0 f0 d0 ea e4 a1 7f 1d 17 05 fc
Cust-A: 84 af 8f 23 b5 31 ca 32 c1 0b f2 76 d2 57 d1 bc
all: 04 ae 37 7e dc 84 aa a4 42 3d db 3c a2 09 b0 60
Further on, the commands must be collected on both firewalls in order to compare the output.
Collecting this only on a single firewall is not relevant.(How to access the second firewall)
Check the checksum mismatch in the above output, and then look for the cluster checksum and compare the output for mismatch.
As visible above, the 'global' and 'root' contexts are synchronized.
The problem is not here. However, the checksum for VDOM 'Cust-A' is different --> this needs to be checked.
When one single checksum is different, the 'all' checksum will be different.
Issue these commands for a more granular view of mismatched VDOMs:
# diag sys ha checksum show <your_vdom_name>
# diag sys ha checksum show <global>
For the above example, the only interesting output will be:
# diag sys ha checksum show Cust-A
Once the not matching object is determined on both cluster unit, run the following command, replacing <object_name> with the actual object name:
# diag sys ha checksum show Cust-A <object_name>
This will show where in the object the differences are and look at that specific place in the config for differences.
Use also the grep option to just display checksums for parts of the configuration.
For example, to display system related configuration checksums in the root VDOM or log-related checksums in the global configuration:
# diagnose sys ha checksum show root | grep system
# diagnose sys ha checksum show global | grep log
Remember: repeat the above commands on all devices to compare the mismatch, then check the corresponding area in the configuration file.
If no mismatch is found, a simple re-calculation of the checksums can fix the out-of-sync problem.
The re-calculated checksums should match and the out-of-sync error messages should stop appearing.
The following command is to re-calculate all HA checksums (run on both units):
# diagnose sys ha checksum recalculate
Or, more specific:
# diagnose sys ha checksum recalculate [<your_vdom_name> | global]
Entering the command without options recalculates all checksums. A VDOM name can be specified to just recalculate the checksums for that VDOM. Enter global to recalculate the global checksum. It should match on all devices in the cluster.
Run the following commands to debug HA synchronization:
# diag debug app hasync 255
# diag debug enable
# execute ha synchronize start
diagnose debug application hatalk -1 <----- To check the Heartbeat communication between HA devices.
Run the following commands to check mismatch right-away:
diag debug config-error-log read <-- (1)
diag hardware device disk <-- (2)
show sys storage <-- (3)
show wanopt storage <-- (4)
(1): Check the output to identify issues with configuration lines that were not accepted. Try to manually configure the device configuration item listed.
(2): Check the device disk on both devices as the size and availability should match.
(3): Check the size of the storage disk as it should match on both devices.
(4): Check the size of wanopt disk as the size should match.
The Fortinet Security Fabric brings together the concepts of convergence and consolidation to provide comprehensive cybersecurity protection for all users, devices, and applications and across all network edges.
Copyright 2023 Fortinet, Inc. All Rights Reserved.