Technical Tip: Procedure for manual synchronization for HA out-of-sync issue

dbabic · ‎04-23-2015

Description

This article describes the methods used to force the synchronization of a High Availability (HA) cluster before proceeding to rebuild the HA cluster.

Scope

FortiGate, High Availability synchronization.

Solution

For this procedure, it is recommended to have access to all units through SSH (i.e, Putty). Furthermore, confirm that all devices in the cluster are running the same firmware version (If VDOMs are enabled, make sure to not be in the VDOM context, and then execute the commands below):

get system status

execute ha manage <id> <username>

get system status

Note:

It is possible to connect to the other units with 'execute ha manage X <username>' where X is the member ID (Available IDs can be found by using the 'execute ha manage ?' command).
Also, HA-related commands in multi-vdom environments must be run from the global VDOM.

To check the FortiGate HA status in the CLI:

get system ha status
diagnose sys ha checksum cluster

All cluster members need to have the same checksum values (compare the last digits of the 'all' checksum). The FortiGate Clustering Protocol (FGCP) uses incremental and periodic updates to make sure that all cluster units share the same configuration.

To check which part of the checksum is not matching, see the following link: Technical Tip: Troubleshooting a checksum mismatch in a FortiGate HA cluster.
Once it is identified that the specific VDOM checksum value is different, it is possible to check in which config the checksum is a mismatch in the specific VDOM:

If the checksums are not matching, perform the following steps, logging ALL the output, in case it is needed to later open a Technical Support case with Fortinet:

Force the Backup unit to synchronize with the Primary unit.

On the Backup unit:

execute ha synchronize start

A simple recalculation of checksums might help. On the Primary unit:

diagnose sys ha checksum recalculate (then check again if synchronized).

On Backup units:

diagnose sys ha checksum recalculate (then check again if synchronized).

Restart the synchronization process and monitor if there is an error in the debug (check both units simultaneously).

Note:

The user may be logged out of the backup units during this process. This is a good sign: Troubleshooting Note: FortiGate HA synchronization messages and cluster verification steps

On the Primary unit:

execute ha synchronize stop
diagnose debug reset
diagnose debug enable
diagnose debug console timestamp enable
diagnose debug application hasync -1
diagnose debug application hatalk -1
execute ha synchronize start

On Backup units:

diagnose debug reset
diagnose debug enable
execute ha synchronize stop
diagnose debug console timestamp enable
diagnose debug application hasync -1
diagnose debug application hatalk -1
execute ha synchronize start

It is possible to check if the checksums match during this debug output. Disable debugging once the Backup units are in sync with the Primary unit, or after the capturing of logs is completed (5-6 minutes):

diagnose debug disable
diagnose debug reset

Restart the HA daemons / restart the units, one by one.

This step requires a maintenance window and might need physical access to both units, as it can affect the traffic. Killing of HA processes may cause HA failover.

If there is no output generated in hasync debug or hatalk debug, a restart of these daemons may be needed. This can be done by running the following commands on each unit at a time:

diagnose sys top <--- The process ID of hasync and hatalk.

Or:

diagnose sys top-summary | grep hasync <----- On v6.4 and above, this command does not exist.
diagnose sys top-summary | grep hatalk <----- On v6.4 and above, this command does not exist.
diagnose sys kill 11 <pid#> <------ Repeat for both noted processes.

After these commands, the daemons normally restart with different numbers (check this via 'diagnose sys top'). Since v6.2, there is an easier way to determine the process ID (in case it does not show up in the 'diagnose sys top' command):

diagnose sys process pidof hasync
diagnose sys process pidof hatalk
diagnose sys kill 11 <pid#> <----- Repeat for both noted processes.

After these commands, the daemons normally restart with different numbers (check this via 'diagnose sys process pidof').

Another way to restart the HA daemons is by running the following commands:

fnsysctl killall hasync

fnsysctl killall hatalk

In certain conditions, this does not solve the problem, or the daemons fail to restart. A hard reboot may be necessary (either execute a reboot from the console or plug/unplug the power supply).

After the reboot, check the disk status for both units (if a disk scan is needed, perform it before anything else), then check the cluster status (checksums) once again.

Manual synchronization. In certain specific scenarios, the cluster fails to synchronize due to some elements in the configuration. To avoid rebuilding the cluster, compare the configurations and perform the changes manually:

Obtain the configurations from both units marked as Primary and Secondary/Backup.

Make sure the console output is standard (no '---More---' text appears*), log the SSH output, and issue the command 'show' in both units.

Note:

To remove the paginated display:

config system console

set output standard

end

Note:

Do not issue 'show full-configuration' unless necessary.

Use any comparison tool available to check the two files side-to-side (i.e., Notepad++ with the 'Compare' plugin).
Certain fields can be ignored (hostname, SN, interface dedicated to management if configured, password hashes, certificates, HA priorities and override settings, and disk labels).
Perform configuration changes in CLI on Backup units to reflect the config of the Primary; if errors occur and they are explanatory, act accordingly. If it is not explanatory and the config can not be changed (added/deleted), ensure these errors are logged and presented in a TAC case.

After all of the changes outlined in the comparison are corrected, check for cluster status once again.

If all the above methods fail, a cluster rebuild may be needed.

Note 1:

Primary and Secondary with different disk statuses.

If the Primary and Secondary units have different disk statuses, the cluster will fail. The following error could be seen on the console of the Secondary:

primary and secondary have different hdisk status. Cannot work with HA primary. Shutdown the box!

The output of the following commands needs to be collected from both cluster members:

get system status
execute disk list

If one of the cluster members shows log disk status as 'Need format' or 'Not Available', the unit needs to be disconnected from the cluster, and a disk format needs to be performed. This requires a reboot. It can be done by executing the following command:

execute formatlogdisk <- A confirmation for a reboot follows.

If the problem persists, open a ticket with Technical Support with the output of the following commands from both units in the cluster:

get system status
execute disk list

Note 2:

The secondary unit is not visible in the cluster. When checking the checksums, the second unit may be missing or with incomplete output as follows:

diagnose sys ha checksum cluster
================== FGVMXXXXXXXXXX1 ==================
is_manage_primary()=1, is_root_primary()=1
debugzone
global: c5 33 93 23 26 9f 4d 79 ed 5f 29 fa 7a 8c c9 10
root: d3 b5 fc 60 f3 f0 f0 d0 ea e4 a1 7f 1d 17 05 fc
all: 04 ae 37 7e dc 84 aa a4 42 3d db 3c a2 09 b0 60

checksum
global: c5 33 93 23 26 9f 4d 79 ed 5f 29 fa 7a 8c c9 10
root: d3 b5 fc 60 f3 f0 f0 d0 ea e4 a1 7f 1d 17 05 fc
all: 04 ae 37 7e dc 84 aa a4 42 3d db 3c a2 09 b0 60

================== FGVMXXXXXXXXXX2 ==================

FortiVM1#

This happens in a situation where the hasync cannot communicate properly with the other unit.
What can be done:

Make sure the units are running the same firmware via 'get system status'.
Reboot both units one at a time, starting with the Secondary.

Note 3:

The secondary unit is not visible in the cluster due to a different FIPS-CC mode.

To verify, run the following on both units, and a similar error would be seen:

diagnose debug app hatalk -1

diagnose debug enable

<hatalk> vcluster_0: ha_prio=0(primary), state/chg_time/now=2(work)/1750807157/1750812543
<hatalk:WARN> 'FG-SerialNumber' enc/auth mismatch: hdr_enc/auth=0/0, my_enc/auth=1/1
<hatalk> vcluster_0: ha_prio=0(primary), state/chg_time/now=2(work)/1750807157/1750812553
<hatalk> vcluster_0: ha_prio=0(primary), state/chg_time/now=2(work)/1750807157/1750812563
<hatalk:WARN> 'FG-SerialNumber' enc/auth mismatch: hdr_enc/auth=0/0, my_enc/auth=1/1

To disable the real-time debug, run the following on both units:

diagnose debug disable

diagnose debug reset

Compare the fips-mode on both units under get system status.

If one unit mode differs, enable it by following this article: Technical Tip: How to enable FIPS-CC mode.

It can only be enabled through the console. The 'admin' account needs to exist, and the config will be removed after enabling FIPS-CC mode.

Restore the modified backup config for the secondary afterwards, and the secondary unit should be visible in the HA cluster.

An additional method to recover the Synchronization of the HA:

Step 1. Obtain the Configuration File of the Primary Unit.

Step 2. Edit the file to be used in the secondary unit by making the following modifications to the text file:

config system global

set hostname XXXX -> Name of Secondary Device.

set alias "Secondary Serial Number"

end

config system ha

set priority <xxx> <------------ The Priority Value has to be lower than the Primary Firewall

end

Step 3. Go to the config system ha -> Configuration corresponding to the Secondary equipment.

Step 4. Disconnect the cable in the LAN ports of the Secondary equipment.

Step 5. Disconnect the cable in the heartbeat interface from the Secondary device.

Step 6. Connect via GUI to the Secondary device and load the configuration file that was modified.

Step 7. Connect the cable back again in the heartbeat interface of the Secondary device.

Step 8. Connect the cable back again in the ports of the Secondary device.

Step 9. Run the following commands to check the HA status after the modification:

get hardware status

diagnose sys ha checksum show

diagnose sys ha checksum show global

After this modification, it should synchronize again.

Note:

In some cases, accessing the Secondary FortiGate's CLI via the Primary FortiGate's CLI will show frequent disconnections when trying to check the configuration on the Secondary, and the HA will still be out of sync. The solution is to reboot the Secondary FortiGate but ensure to follow all the steps given above before proceeding to reboot the FortiGate.

Note:

If the previous steps do not resolve the issue, the configuration file from the primary unit may need to be downloaded and manually edited:

Modify the HA parameters.
Update the hostname.
Adjust the management interface, if applicable.

Once edited, deploy the file to the secondary unit. This approach is straightforward and effective as a last resort.

Note: In the cloud platforms below, the IP interface for interface is not synced by HA, since the FortiGates may be in a different subnet:

FGT_ARM64_AZURE
FGT_ARM64_GCP
FGT_VM64_ALI
FGT_VM64_AZURE
FGT_VM64_GCP
FGT_VM64_IBM
FGT_VM64_RAXONDEMAND
FortiGate VM08V
FortiGate VM Azure On-Demand

For example: an IPsec tunnel IP address on FortiGate is not synced (only for the platform listed above). The workaround is to manually configure the IP addresses on both devices.

Related articles:

Technical Tip: Procedure for manual synchronization for HA out-of-sync issue

You are leaving our website