FortiGate
FortiGate Next Generation Firewall utilizes purpose-built security processors and threat intelligence security services from FortiGuard labs to deliver top-rated protection and high performance, including encrypted traffic.
pjang
Staff & Editor
Staff & Editor
Article Id 418850
Description

This article describes a known-issue where the HA Primary FortiGate can become stuck in the SENT-IMAGE state during a firmware upgrade. When this occurs, the Secondary unit is often upgraded successfully, but the Primary unit is stuck on the pre-upgrade firmware and cannot be upgraded successfully, even after being rebooted. Administrators having this issue may also find that it happens every time an upgrade is performed, no matter the firmware versions involved or the method by which the upgrade is conducted (direct firmware upload, FortiManager-based upgrade, etc.).

Scope FortiGate, High Availability.
Solution

As a primer, FortiGate HA clusters typically perform 'uninterruptible upgrades' by default, which follow the following sequence of events:

  • Administrator pushes firmware file to the HA Primary FortiGate.
  • The HA Primary FortiGate forwards the firmware file to the Secondary FortiGate(s).
  • Secondary FortiGate(s) reboot to apply firmware. During this process, the Primary FortiGate monitors for HA heartbeats from the Secondary FortiGate(s) to determine if the upgrade occurred successfully.
  • Once the Secondary FortiGates are back online, a new HA Primary is elected and the old HA Primary performs the upgrade.

For more in-depth information on this process, refer to the following KB article: Technical Tip: FortiGate HA upgrade procedure and the status during the upgrade

 

It is during the third step mentioned above where the HA cluster can become stuck in the SENT-IMAGE state, and it is typically caused by HA heartbeat timers that are set excessively long.

 

The reason this occurs is because during the upgrade process, the HA Primary FortiGate monitors the Secondary FortiGate to confirm that it has gone offline (i.e., to reboot and apply the firmware upgrade), and it does this by checking to see if the heartbeat connection to the Secondary is lost.

 

Critically, if the heartbeat connection does not go down during the Secondary's upgrade process then the Primary will become stuck in the SENT-IMAGE state and will not proceed with the upgrade for a long period of time. This can occur if the Secondary FortiGate completes a reboot faster than the amount of time it takes for a heartbeat connection to be considered lost. For reference, HA heartbeat settings are controlled by the following options:

 

config system ha

set hb-interval <integer> <--- Time between sending heartbeat packets (1 - 20).

set hb-interval-in-milliseconds {100 | 10} <--- Units of heartbeat interval time between sending heartbeat packets. Default is 100ms.

set hb-lost-threshold <integer> <--- Number of lost heartbeats to signal a failure (1 - 60).

end

 

To determine the amount of time that it takes for one cluster member to detect another member as unavailable/down, use the following formula:

 

Heartbeat Failure Period = (hb-interval * hb-interval-in-milliseconds) * hb-lost-threshold

 

As an example, consider the following settings:

 

FortiGate # show full system ha | grep hb-

set hb-interval 20
set hb-interval-in-milliseconds 100ms
set hb-lost-threshold 60

 

In the above example, it would take (20 * 100ms) * 60 = 120000ms (120 seconds) of continuous heartbeat failures before the HA Primary detects the Secondary as down/unavailable. However, if the Secondary reboots faster than that (say a FortiGate-VM that only take 70-80 seconds to reboot), then the heartbeat would never be considered lost/down and the Primary FortiGate will become stuck in the SENT-IMAGE state.

 

Recommendation:

Ensure that the hb-intervalhb-interval-in-milliseconds, and hb-lost-threshold settings are tuned downward so that  the heartbeat failure period is shorter than the time it takes for the FortiGate to reboot. Different FortiGate models will take different amounts of time to complete a reboot, so the exact tuning can be variable (aiming for a heartbeat failure period below 60 seconds may be a good general target). For reference, the default timing settings result in a heartbeat failure period that is only 12 seconds long:

 

Default_FortiGate # show full system ha | grep hb-

set hb-interval 2
set hb-interval-in-milliseconds 100ms
set hb-lost-threshold 6

 

Note: if the Primary FortiGate-VM is already stuck in the SENT-IMAGE state then it will not be able to proceed with its own upgrade for a long period of time. The following are some workaround options for this situation:

 

Option 1: Wait for the number of minutes defined by the uninterruptible-primary-wait setting.

  • This is the maximum number of minutes that the HA Primary FortiGate will wait before it decides that the HA secondary has been upgraded and proceeds with its own upgrade:

config system ha

set uninterruptible-primary-wait <15 - 300, default = 30 minutes>

end

 

 

Option 2: Reboot the Primary FortiGate.

  • Rebooting the Primary FortiGate will clear the SENT-IMAGE state and allow the FortiGate to re-attempt the firmware upgrade. As noted above, ensure that the heartbeat timers are shortened to less than the time it takes for the Secondary FortiGate to complete a reboot, otherwise the issue will simply happen again.

 

Option 3: Separate the HA FortiGate members and upgrade them individually.

  • If for some reason it is not possible to modify the heartbeat timer settings, then the alternative is to separate the HA cluster members so that they can be upgraded individually. This will prevent the FortiGates from needing to check for a heartbeat connection drop, which will avoid the issue.
  • Ensure that the FortiGate cluster members are not connected to the same data connections when the heartbeat interfaces are disconnected, as this could otherwise cause a split-brain scenario and impacts to the network.

 

Related documents:

Technical Tip: FortiGate HA upgrade procedure and the status during the upgrade

Technical Tip: Interval between HA heartbeats

Troubleshooting Tip: How to troubleshoot HA 'Heartbeat packet lost' issues in a FortiGate HA Cluste...

Contributors