Description
This article describes the methods used to force the synchronization on the cluster before proceeding to rebuild the HA (as a last resort).
Scope
FortiGate, High Availability synchronization.
Solution
For this procedure, it is recommended to have access to all units through SSH (i.e.. Putty).
Note:
It is possible to connect to the other units with 'exec ha manage X <username>' where X is the member ID (Available IDs can be found by using 'exec ha manage ?').
To check the FortiGate HA status in the CLI:
get sys ha status
diagnose sys ha checksum cluster
If the checksums are not matching, perform the following steps, logging ALL the output, in case it is needed to later open a Technical Support case with Fortinet:
- Force the Backup unit to synchronize with the Primary unit. On the Backup unit:
execute ha synchronize start
-
Simple recalculation of checksums might help. On the Primary unit:
diagnose sys ha checksum recalculate (then check again if synchronized).
On Backup units:
diagnose sys ha checksum recalculate (then check again if synchronized).
-
Restart the synchronization process and monitor if there is an error in the debug (check both units at the same time).
Note:
The user may be logged out of the backup units during this process, this is a good sign (explained here).
On the Primary unit:
execute ha synchronize stop
diag debug reset
diag debug enable
diag debug console timestamp enable
diag debug application hasync -1
diag debug application hatalk -1
execute ha synchronize start
On Backup units:
diag debug reset
diag debug enable
execute ha synchronize stop
diag debug console timestamp enable
diag debug application hasync -1
diag debug application hatalk -1
execute ha synchronize start
It is possible to check if the checksums are matching during this debug output. Disable debugging once the Backup units are in sync with the Primary unit, or after the capturing of logs is completed (5-6 minutes):
diag debug disable
diag debug reset
- Manual synchronization. In certain specific scenarios, the cluster fails to synchronize due to some elements in the configuration. To avoid rebuilding the cluster, compare the configurations and perform the changes manually.
-
Obtain the configurations from both units clearly marked as Primary and Secondary/Backup.
Make sure the console output is standard (no '---More---' text appears*), log the SSH output, and issue the command 'show' in both units**.
Note*: To remove paginated display:
config system console
set output standard
end
Note:
Do not issue 'show full-configuration' unless necessary.
- Use any comparison tool available to check the two files side-to-side (i.e. Notepad++ with the 'Compare' plugin).
-
Certain fields can be ignored (hostname, SN, interface dedicated to management if configured, password hashes, certificates, HA priorities and override settings, and disk labels).
-
Perform configuration changes in CLI on Backup units to reflect the config of the Primary; if errors occur and they are explanatory, act accordingly. If it is not explanatory and the config can not be changed (added/deleted), make sure these errors are logged and presented in a TAC case.
After all the changes outlined in the comparison are corrected, check for cluster status once again.
-
Restart the ha daemons / restart the units, one by one.
Note:
This step requires a maintenance window and might need physical access to both units, as it can affect the traffic.
If there is no output generated in hasync debug or hatalk debug, a restart of these daemons may be needed. This can be done by running the following commands on each unit at a time:
diag sys top <- Note: the process ID of hasync and hatalk.
Or:
diag sys top-summary | grep hasync
diag sys top-summary | grep hatalk
diag sys kill 11 <pid#> <- Repeat for both noted processes.
After these commands, the daemons normally restart with different numbers (check this via 'diag sys top'). Since v6.2, there is an easier way to determine the process ID (in case, it will not show up in the 'diag sys top' command):
diag sys process pidof hasync
diag sys process pidof hatalk
diag sys kill 11 <pid#> <----- Repeat for both noted processes.
After these commands, the daemons normally restart with different numbers (check this via 'diag sys process pidof').
Another way to restart the ha daemons is by running below commands:
fnsysctl killall hasync
fnsysctl killall hatalk
In certain conditions, this does not solve the problem, or the daemons fail to restart. Be prepared for this situation, as a hard reboot may be necessary (either exec reboot from the console or plug/unplug the power supply).
After the reboot, check the disk status for both units (if a disk scan is needed, perform it before anything else), then check the cluster status (checksums) once again.
- If all the above methods fail, a cluster rebuild may be needed.
Note 1:
Primary and Secondary with different disk statuses.
If the Primary and Secondary units have different disk statuses, the cluster will fail. The following error could be seen on the console of the Secondary:
Slave and master have different hdisk status. Cannot work with HA master. Shutdown the box!
The output of the following commands needs to be collected from both cluster members:
get sys status
exec disk list
If one of the cluster members shows log disk status as 'Need format' or 'Not Available', the unit needs to be disconnected from the cluster and a disk format needs to be performed. This requires a reboot. It can be done by executing the following command:
execute formatlogdisk <- A confirmation for a reboot follows.
If the problem persists, open a ticket with Technical Support with the output of the following commands from both units in the cluster:
get sys status
exec disk list
Note 2:
The secondary unit is not visible in the cluster. When checking the checksums, the second unit may be missing or with incomplete output as follows:
diag sys ha checksum cluster
================== FGVMXXXXXXXXXX1 ==================
is_manage_master()=1, is_root_master()=1
debugzone
global: c5 33 93 23 26 9f 4d 79 ed 5f 29 fa 7a 8c c9 10
root: d3 b5 fc 60 f3 f0 f0 d0 ea e4 a1 7f 1d 17 05 fc
all: 04 ae 37 7e dc 84 aa a4 42 3d db 3c a2 09 b0 60
checksum
global: c5 33 93 23 26 9f 4d 79 ed 5f 29 fa 7a 8c c9 10
root: d3 b5 fc 60 f3 f0 f0 d0 ea e4 a1 7f 1d 17 05 fc
all: 04 ae 37 7e dc 84 aa a4 42 3d db 3c a2 09 b0 60
================== FGVMXXXXXXXXXX2 ==================
FortiVM1#
This happens in a situation where the hasync cannot communicate properly with the other unit.
What can be done:
- Make sure the units are running the same firmware via 'get system status'.
- Reboot both units one at a time, starting with the Secondary.
An additional method to recover the Synchronization of the HA:
Step 1. Obtain the Configuration File of the Master Unit.
Step 2. Edit the file to be used in the Slave unit by making the following modifications to the text file:
config system global
set hostname XXXX -> Name of Slave Device.
set alias "Slave Serial Number"
Step 3. Go to config system ha -> Configuration corresponding to the Slave equipment.
Step 4. Disconnect the cable in the LAN ports of the Slave equipment.
Step 5. Disconnect the cable in the heartbeat interface from the Slave device.
Step 6. Connect via GUI to the Slave device and load the configuration file that was modified.
Step 7. Connect the cable back again in the heartbeat interface of the Slave device.
Step 8. Connect the cable back again in the ports of the Slave device.
Step 9. Run the following commands to check the HA status after the modification:
get hardware status
diagnose system ha checksum show
diagnose system ha checksum show global
After this modification, it should synchronize again.
Related articles:
Technical Tip: How to create a log file of a session using PuTTY
Technical Tip: Rebuilding an HA cluster