Description
This article describes the methods used to force the synchronization on the cluster before proceeding to rebuild the HA (as last resort).
Scope
High Availability synchronization.
Solution
For this procedure, it is recommended to have access to all units through SSH (ie. Putty).
Note: It's possible to connect to the other units with 'exec ha manage X' where X is the member ID (Available IDs can be found by using 'exec ha manage?').
To check the FortiGate HA status in CLI:
# get sys ha status
# diagnose sys ha cluster-csum (FortiOS 5.0, 5.2)
# diagnose sys ha checksum cluster (FortiOS 5.4 and newer)
If the checksums are not matching, perform the following steps, logging ALL the output, in case it is needed to later open a Technical Support case with Fortinet:
1) Simple recalculation of checksums might help.
On the Master unit:
# diagnose sys ha checksum recalculate <----- for FortiOS 5.4 and newer
# diagnose sys ha csum-recalculate <----- for FortiOS 5.2 and older
(check if synchronized)
On Backup units:
# diagnose sys ha checksum recalculate <----- for FortiOS 5.4 and newer
# diagnose sys ha csum-recalculate <----- for FortiOS 5.2 and older
(check if synchronized)
2) Restart the synchronization process and monitor if there is an error in the debug (check both units at the same time).
# execute ha synchronize stop
# diag debug reset
# diag debug enable
# diag debug console timestamp enable
# diag debug application hasync -1
# diag debug application hatalk -1
# execute ha synchronize start
On Backup units:
# diag debug reset
# diag debug enable
# execute ha synchronize stop
# diag debug console timestamp enable
# diag debug application hasync -1
# diag debug application hatalk -1
# execute ha synchronize start
It's possible to check if the checksums are matching during this debug output
Disable debugging once the Backup units are in sync with the Master unit or the capturing of logs is completed:
# diag debug disable
# diag debug reset
3) Manual synchronization
In certain specific scenarios, the cluster fails to synchronize due to some elements in the configuration.
To avoid rebuilding the cluster, compare the configurations and perform the changes manually.
a) Obtain the configurations from both units clearly marked as Master and Backup.
Make sure the console output is standard (no '---More---' text appears*), log the ssh output, and issue the command 'show' in both units**.
Note*: To remove paginated display:
# config system console
set output standard
end
Note**: Do NOT issue 'show full-configuration' unless necessary.
b) Use a comparison tool to check the two files side-to-side(ie. Notepad++ with the 'Compare' plugin).
c) Certain fields can be ignored (hostname, SN, interface dedicated to management if configured, password hashes, certificates, HA priorities and override settings, and disk labels).
d) Perform configuration changes in CLI on Backup units to reflect the Master config; if errors occur and they are explanatory, act accordingly. If they are not explanatory and the config can’t be changed (added/deleted), make sure these errors are logged and presented in a TAC case.
After all, the changes outlined in the comparison are corrected, check for cluster status once again.
4) Restart the ha daemons / restart the units, one by one.
Note: This step requires a maintenance window and might need physical access to both units, as it can affect the traffic
In case there is no output generated in hasync debug or hatalk debug, a restart of these daemons may be needed. This can be done by running the following commands on each unit at a time:
# diag sys top <----- Note: the process ID of hasync and hatalk
or
# diag sys top-summary | grep hasync
# diag sys top-summary | grep hatalk
# diag sys kill 11 <pid#> <-----repeat for both noted processes
After these commands, the daemons normally restart with different numbers (check by # diag sys top).
Since 6.2 there is an easier way to determine the process ID (in case, it will not show up in the 'diag sys top' command:
# diag sys process pidof hasync
# diag sys process pidof hatalk
# diag sys kill 11 <pid#> <-----repeat for both noted processes
After these commands, the daemons normally restart with different numbers (check by # diag sys process pidof).
In certain conditions, this does not solve the problem, or the daemons fail to restart.
Be prepared for this situation, as a hard reboot may be necessary (either exec reboot from the console or plug/unplug power supply).
After reboot, check disk status for both units (if diskscan is needed, perform it before anything else), then check the cluster status (checksums) once again.
5) If all the above methods fail, a cluster rebuild may be needed.
Note 1: Master and Slave with different disk status
If the Master and slave units have different disk status, the cluster would fail.
The following error could be seen on the console of the Slave:
'Slave and master have different hdisk status. Cannot work with HA master. Shutdown the box!'
Output of the following commands needs to be collected from both the cluster members:
# get sys status
# exec disk list
If one of the cluster member shows log disk status as 'Need format' or 'Not Available', the unit needs to be disconnected from the cluster and a disk format needs to be performed.
This requires a reboot. It can be done by executing the following command:
# execute formatlogdisk <----- a confirmation for reboot follows
If the problem persists, open a ticket with Technical Support with the output of the following commands from both units in the cluster:
# get sys status
# exec disk list
Note 2: Slave unit not seen in the cluster
When checking the checksums, the second unit may be missing or with incomplete output as follows:
#FortiVM1# diag sys ha checksum cluster================== FGVMXXXXXXXXXX1 ==================
is_manage_master()=1, is_root_master()=1
debugzone
global: c5 33 93 23 26 9f 4d 79 ed 5f 29 fa 7a 8c c9 10
root: d3 b5 fc 60 f3 f0 f0 d0 ea e4 a1 7f 1d 17 05 fc
all: 04 ae 37 7e dc 84 aa a4 42 3d db 3c a2 09 b0 60
checksum
global: c5 33 93 23 26 9f 4d 79 ed 5f 29 fa 7a 8c c9 10
root: d3 b5 fc 60 f3 f0 f0 d0 ea e4 a1 7f 1d 17 05 fc
all: 04 ae 37 7e dc 84 aa a4 42 3d db 3c a2 09 b0 60
================== FGVMXXXXXXXXXX2 ==================
FortiVM1#
This happens in the situation the hasync can’t communicate properly with the other unit.
What can be done:
- make sure the units are running the same firmware #get system status
- reboot both units one at a time, starting with the Slave
Related Articles
Technical Note: How to create a log file of a session using PuTTY