Technical Tip: Procedure for HA manual synchronization

dbabic · ‎04-23-2015

Description

This article describes the methods used to force the synchronization on the cluster before proceeding to rebuild the HA (as a last resort).

Scope

FortiGate, High Availability synchronization.

Solution

For this procedure, it is recommended to have access to all units through SSH (i.e.. Putty). Furthermore, confirm that all devices in the cluster are running the same firmware version:

get system status

exec ha manage <id> <username>

get system status

Note:

It is possible to connect to the other units with 'exec ha manage X <username>' where X is the member ID (Available IDs can be found by using 'exec ha manage ?').

To check the FortiGate HA status in the CLI:

get sys ha status
diagnose sys ha checksum cluster

All cluster members need to have the same checksum values (compare the last digits of ‘all’ checksum).

Further, check which part of the checksum is not matching, as described here.
Once it is identified the specific VDOM checksum value is different, it is possible to check in which config, the checksum is a mismatch in the specific VDOM:

If the checksums are not matching, perform the following steps, logging ALL the output, in case it is needed to later open a Technical Support case with Fortinet:

1. Force the Backup unit to synchronize with the Primary unit. On the Backup unit:

execute ha synchronize start

Simple recalculation of checksums might help. On the Primary unit:

diagnose sys ha checksum recalculate (then check again if synchronized).

On Backup units:

diagnose sys ha checksum recalculate (then check again if synchronized).

Restart the synchronization process and monitor if there is an error in the debug (check both units simultaneously).

Note:

The user may be logged out of the backup units during this process, this is a good sign: Troubleshooting Note: FortiGate HA synchronization messages and cluster verification steps

On the Primary unit:

execute ha synchronize stop
diagnose debug reset
diagnose debug enable
diagnose debug console timestamp enable
diagnose debug application hasync -1
diagnose debug application hatalk -1
execute ha synchronize start

On Backup units:

diagnose debug reset
diagnose debug enable
execute ha synchronize stop
diagnose debug console timestamp enable
diagnose debug application hasync -1
diagnose debug application hatalk -1
execute ha synchronize start

It is possible to check if the checksums match during this debug output. Disable debugging once the Backup units are in sync with the Primary unit, or after the capturing of logs is completed (5-6 minutes):

diagnose debug disable
diagnose debug reset

Manual synchronization. In certain specific scenarios, the cluster fails to synchronize due to some elements in the configuration. To avoid rebuilding the cluster, compare the configurations and perform the changes manually.

Obtain the configurations from both units marked as Primary and Secondary/Backup.

Make sure the console output is standard (no '---More---' text appears*), log the SSH output, and issue the command 'show' in both units**.

Note*:

To remove paginated display:

config system console

set output standard

end

Note:

Do not issue 'show full-configuration' unless necessary.

Use any comparison tool available to check the two files side-to-side (i.e. Notepad++ with the 'Compare' plugin).
Certain fields can be ignored (hostname, SN, interface dedicated to management if configured, password hashes, certificates, HA priorities and override settings, and disk labels).
Perform configuration changes in CLI on Backup units to reflect the config of the Primary; if errors occur and they are explanatory, act accordingly. If it is not explanatory and the config can not be changed (added/deleted), ensure these errors are logged and presented in a TAC case.

After all of the changes outlined in the comparison are corrected, check for cluster status once again.

Restart the HA daemons / restart the units, one by one.

Note:

This step requires a maintenance window and might need physical access to both units, as it can affect the traffic.

If there is no output generated in hasync debug or hatalk debug, a restart of these daemons may be needed. This can be done by running the following commands on each unit at a time:

diagnose sys top <- Note: the process ID of hasync and hatalk.

Or:

diagnose sys top-summary | grep hasync <----- On v6.4 and above, this command does not exist.
diagnose sys top-summary | grep hatalk <----- On v6.4 and above, this command does not exist.
diagnose sys kill 11 <pid#> <------ Repeat for both noted processes.

After these commands, the daemons normally restart with different numbers (check this via 'diag sys top'). Since v6.2, there is an easier way to determine the process ID (in case, it will not show up in the 'diag sys top' command):

diagnose sys process pidof hasync
diagnose sys process pidof hatalk
diagnose sys kill 11 <pid#> <----- Repeat for both noted processes.

After these commands, the daemons normally restart with different numbers (check this via 'diag sys process pidof').

Another way to restart the HA daemons is by running the following commands:

fnsysctl killall hasync

fnsysctl killall hatalk

In certain conditions, this does not solve the problem, or the daemons fail to restart. Be prepared for this situation, as a hard reboot may be necessary (either exec reboot from the console or plug/unplug the power supply).

After the reboot, check the disk status for both units (if a disk scan is needed, perform it before anything else), then check the cluster status (checksums) once again.

If all the above methods fail, a cluster rebuild may be needed.

Note 1:

Primary and Secondary with different disk statuses.

If the Primary and Secondary units have different disk statuses, the cluster will fail. The following error could be seen on the console of the Secondary:

primary and secondary have different hdisk status. Cannot work with HA primary. Shutdown the box!

The output of the following commands needs to be collected from both cluster members:

get sys status
exec disk list

If one of the cluster members shows log disk status as 'Need format' or 'Not Available', the unit needs to be disconnected from the cluster and a disk format needs to be performed. This requires a reboot. It can be done by executing the following command:

execute formatlogdisk <- A confirmation for a reboot follows.

If the problem persists, open a ticket with Technical Support with the output of the following commands from both units in the cluster:

get sys status
exec disk list

Note 2:

The secondary unit is not visible in the cluster. When checking the checksums, the second unit may be missing or with incomplete output as follows:

diagnose sys ha checksum cluster
================== FGVMXXXXXXXXXX1 ==================
is_manage_primary()=1, is_root_primary()=1
debugzone
global: c5 33 93 23 26 9f 4d 79 ed 5f 29 fa 7a 8c c9 10
root: d3 b5 fc 60 f3 f0 f0 d0 ea e4 a1 7f 1d 17 05 fc
all: 04 ae 37 7e dc 84 aa a4 42 3d db 3c a2 09 b0 60

checksum
global: c5 33 93 23 26 9f 4d 79 ed 5f 29 fa 7a 8c c9 10
root: d3 b5 fc 60 f3 f0 f0 d0 ea e4 a1 7f 1d 17 05 fc
all: 04 ae 37 7e dc 84 aa a4 42 3d db 3c a2 09 b0 60

================== FGVMXXXXXXXXXX2 ==================

FortiVM1#

This happens in a situation where the hasync cannot communicate properly with the other unit.
What can be done:

Make sure the units are running the same firmware via 'get system status'.
Reboot both units one at a time, starting with the Secondary.

An additional method to recover the Synchronization of the HA:

Step 1. Obtain the Configuration File of the Primary Unit.

Step 2. Edit the file to be used in the secondary unit by making the following modifications to the text file:

config system global

set hostname XXXX -> Name of Secondary Device.

set alias "Secondary Serial Number"

Step 3. Go to config system ha -> Configuration corresponding to the Secondary equipment.

Step 4. Disconnect the cable in the LAN ports of the Secondary equipment.

Step 5. Disconnect the cable in the heartbeat interface from the Secondary device.

Step 6. Connect via GUI to the Secondary device and load the configuration file that was modified.

Step 7. Connect the cable back again in the heartbeat interface of the Secondary device.

Step 8. Connect the cable back again in the ports of the Secondary device.

Step 9. Run the following commands to check the HA status after the modification:

get hardware status

diagnose system ha checksum show

diagnose system ha checksum show global

After this modification, it should synchronize again.

Note:

In some cases accessing the Secondary FortiGate's CLI via the Primary FortiGate's CLI will show frequent disconnections when trying to check the configuration on Secondary and the HA will be still out of sync, the solution is to reboot the Secondary FortiGate but ensure to follow all the steps given above before proceeding to reboot the FortiGate.

Note:

If the previous steps do not resolve the issue, the configuration file from the primary unit may need to be downloaded and manually edited:

Modify the HA parameters.
Update the hostname.
Adjust the management interface, if applicable.

Once edited, deploy the file to the secondary unit. This approach is straightforward and effective as a last resort.

Related articles:

Technical Tip: How to create a log file of a session using PuTTY

Technical Tip: Rebuilding an HA cluster

Technical Tip: Correcting-an-out-of-sync-HA-cluster

Troubleshooting Tip: How to troubleshoot HA synchronization issue using GUI and CLI on FortiGate/For...

Technical Tip: Procedure for HA manual synchronization

You are leaving our website