Troubleshooting Tip: FortiGate High-Availability (HA) Upgrade: Troubleshooting SENT-IMAGE Status Loop
| Description | This article describes a scenario (and the root cause) where performing a firmware upgrade on High-Availability (HA) FortiGate clusters can result in the upgrade failing and the Primary FortiGate being stuck in the SENT-IMAGE state (according to debug output from the hatalk daemon).
A common symptom is that the Secondary FortiGate(s) are upgraded successfully, but the Primary FortiGate never completes its own upgrade. |
| Scope | FortiGate |
| Solution | Before starting: It is important to understand the general procedure for firmware upgrades on HA FortiGate clusters. Refer to the following KB article for further information on this process: Technical Tip: FortiGate HA upgrade procedure and the status during the upgrade.
Issue Description: When the firmware upgrade is initiated, the current Primary FortiGate sends the target firmware image over to the Secondary FortiGate in the HA cluster. The expected procedure is that the Secondary unit receives the firmware image, reboots to apply the image, then assumes the HA Primary role after completing the reboot/upgrade so that the old Primary can proceed with the firmware upgrade.
However, in certain circumstances, the Primary unit may fail to proceed with its own upgrade, even though the Secondary FortiGate has completed the upgrade and is running the new firmware. In this state, the two FortiGates can still form the HA cluster, but they will continually remain out-of-sync with one another due to the firmware version mismatch.
In this scenario, the Primary FortiGate will continually display the following debug messages when running the following debug commands, and it will never complete its own upgrade:
FortiGate # diagnose debug application hatalk -1 <hatalk> entering hatalk_upgrade_timer_func: uprade_state=3(SENT-IMAGE), daemon_bits=0x00000000 <hatalk> leaving hatalk_upgrade_timer_func: uprade_state=3(SENT-IMAGE), daemon_bits=0x00000000 <hatalk> entering hatalk_upgrade_timer_func: uprade_state=3(SENT-IMAGE), daemon_bits=0x00000000 <hatalk> leaving hatalk_upgrade_timer_func: uprade_state=3(SENT-IMAGE), daemon_bits=0x00000000 <hatalk> entering hatalk_upgrade_timer_func: uprade_state=3(SENT-IMAGE), daemon_bits=0x00000000
Root Cause: The most-typical cause for this issue is a misconfiguration of the following HA settings under config system ha:
config system ha set hb-interval <1-20, default = 2> <--- Time between sending heartbeats set hb-interval-in-milliseconds [100ms** | 10ms] <--- Units of interval time between sending heartbeats set hb-lost-threshold <1-60, default = 6> <--- Number of lost heartbeats to signal a failure end
As an example, consider the following configuration:
config system ha set hb-interval 20 set hb-interval-in-milliseconds 100ms set hb-lost-threshold 60 end
The above configuration sets a heartbeat interval of 2 seconds (hb-interval * hb-interval-in-milliseconds) with a lost threshold of 60 heartbeats. This means that for a FortiGate cluster member to be detected as down based on HA heartbeats, it must be unreachable for 120 seconds total.
This can be critical since the HA firmware upgrade process requires the Primary FortiGate to detect that the Secondary FortiGate has gone offline, as that is used to signal that the Secondary has received the image and is proceeding with the firmware upgrade/reboot. Depending on the FortiGate hardware, the Secondary FortiGate can complete a reboot faster than this 120 second window, and so the Primary FortiGate does not know that Secondary has completed its upgrade. As a result, the Primary FortiGate becomes stuck in the SENT-IMAGE state and fails to continue with the firmware upgrade.
Recovering from the stuck state and completing the firmware upgrade: Once the Primary is in this stuck-state, it will not be able to be upgraded as long as it believes that the Secondary is still working on its upgrade. The suggested process for recovering from this state and upgrading the Primary is as follows:
config system ha set hb-interval 5 set hb-interval-in-milliseconds 100ms set hb-lost-threshold 20 end
In some cases, it may not be possible to disconnect or shutdown the HA Secondary unit to prevent interaction with the HA Primary. In those cases, an alternative method is as follows:
|
