Skip to main content
jcovarrubias
Staff
Staff
November 11, 2024

Troubleshooting Tip: FortiGate High-Availability (HA) Upgrade: Troubleshooting SENT-IMAGE Status Loop

  • November 11, 2024
  • 0 replies
  • 1870 views
Description

This article describes a scenario (and the root cause) where performing a firmware upgrade on High-Availability (HA) FortiGate clusters can result in the upgrade failing and the Primary FortiGate being stuck in the SENT-IMAGE state (according to debug output from the hatalk daemon).

 

A common symptom is that the Secondary FortiGate(s) are upgraded successfully, but the Primary FortiGate never completes its own upgrade.

Scope FortiGate
Solution

Before starting:

It is important to understand the general procedure for firmware upgrades on HA FortiGate clusters. Refer to the following KB article for further information on this process: Technical Tip: FortiGate HA upgrade procedure and the status during the upgrade.

 

Issue Description:

When the firmware upgrade is initiated, the current Primary FortiGate sends the target firmware image over to the Secondary FortiGate in the HA cluster. The expected procedure is that the Secondary unit receives the firmware image, reboots to apply the image, then assumes the HA Primary role after completing the reboot/upgrade so that the old Primary can proceed with the firmware upgrade.

 

However, in certain circumstances, the Primary unit may fail to proceed with its own upgrade, even though the Secondary FortiGate has completed the upgrade and is running the new firmware. In this state, the two FortiGates can still form the HA cluster, but they will continually remain out-of-sync with one another due to the firmware version mismatch.

 

In this scenario, the Primary FortiGate will continually display the following debug messages when running the following debug commands, and it will never complete its own upgrade:

 

FortiGate # diagnose debug application hatalk -1
FortiGate # diagnose debug enable

<hatalk> entering hatalk_upgrade_timer_func: uprade_state=3(SENT-IMAGE), daemon_bits=0x00000000

<hatalk> leaving hatalk_upgrade_timer_func: uprade_state=3(SENT-IMAGE), daemon_bits=0x00000000

<hatalk> entering hatalk_upgrade_timer_func: uprade_state=3(SENT-IMAGE), daemon_bits=0x00000000

<hatalk> leaving hatalk_upgrade_timer_func: uprade_state=3(SENT-IMAGE), daemon_bits=0x00000000

<hatalk> entering hatalk_upgrade_timer_func: uprade_state=3(SENT-IMAGE), daemon_bits=0x00000000

 

Root Cause:

The most-typical cause for this issue is a misconfiguration of the following HA settings under config system ha:

 

config system ha

set hb-interval <1-20, default = 2> <--- Time between sending heartbeats

set hb-interval-in-milliseconds [100ms** | 10ms] <--- Units of interval time between sending heartbeats

set hb-lost-threshold <1-60, default = 6> <--- Number of lost heartbeats to signal a failure

end

 

As an example, consider the following configuration:

 

config system ha

set hb-interval 20

set hb-interval-in-milliseconds 100ms

set hb-lost-threshold 60

end

 

The above configuration sets a heartbeat interval of 2 seconds (hb-interval * hb-interval-in-milliseconds) with a lost threshold of 60 heartbeats. This means that for a FortiGate cluster member to be detected as down based on HA heartbeats, it must be unreachable for 120 seconds total.

 

This can be critical since the HA firmware upgrade process requires the Primary FortiGate to detect that the Secondary FortiGate has gone offline, as that is used to signal that the Secondary has received the image and is proceeding with the firmware upgrade/reboot. Depending on the FortiGate hardware, the Secondary FortiGate can complete a reboot faster than this 120 second window, and so the Primary FortiGate does not know that Secondary has completed its upgrade. As a result, the Primary FortiGate becomes stuck in the SENT-IMAGE state and fails to continue with the firmware upgrade.

 

Recovering from the stuck state and completing the firmware upgrade:

Once the Primary is in this stuck-state, it will not be able to be upgraded as long as it believes that the Secondary is still working on its upgrade. The suggested process for recovering from this state and upgrading the Primary is as follows:

  1. Disconnect the heartbeat connections between the Primary unit from the Secondary unit temporarily. The best method for doing this will vary depending on the administrator's options:
    • If physical device access is an option, then consider disconnecting the Primary FortiGate from the rest of the network (disconnect data connections first, then heartbeat connections). This isolates the Primary and allows the upgraded Secondary to handle network traffic.
    • Alternatively, consider shutting down the Secondary FortiGate temporarily so that it cannot cause a split-brain HA cluster issue or communicate with the Primary FortiGate.
  2. Reboot the Primary FortiGate to clear the current state of the hatalk daemon and ensure that it sees itself as the sole member of the current HA cluster.
  3. Upgrade the Primary FortiGate to the same firmware version as the Secondary FortiGate.
  4. Once the upgrade is completed, power on/reconnect the Secondary unit to the Primary unit. Confirm that the cluster reforms and that both units eventually become in-sync.
  5. Once the units are in-sync, modify the configuration under config system ha and change the heartbeat interval/threshold settings to shorten the heartbeat/cluster member detection time. For example, the following example settings would result in the FortiGate sending heartbeats every half-second (5 * 100ms) and detecting a cluster peer as down within 10 seconds (20 heartbeats):

 

config system ha

set hb-interval 5

set hb-interval-in-milliseconds 100ms

set hb-lost-threshold 20

end

 

In some cases, it may not be possible to disconnect or shutdown the HA Secondary unit to prevent interaction with the HA Primary. In those cases, an alternative method is as follows:

  1. Modify the HA configuration settings as per Step 5 above.
  2. Reboot the Primary FortiGate to clear the current state of the hatalk process.
  3. Apply the firmware upgrade to the Primary FortiGate. The Primary FortiGate will attempt to send the firmware file to the Secondary (which will be the same version as it is currently running, which is OK) and the Primary will use the new heartbeat thresholds to determine if the Secondary FortiGate has rebooted.
  4. As long as the Secondary FortiGate reboots, the Primary should be able to detect this occurring and trigger its own upgrade to the new firmware.
  5. After the upgrade, both units should be running the same firmware version and should synchronize their configurations. Check to make sure that the new heartbeat threshold setting are still set on both units.