Technical Tip: HA Cluster failover still occurs even when monitored interface recovers before failover-hold-time expires

acevik · ‎03-27-2025

Description

This article describes the failover-hold-time function for High Availability (HA) FortiGate clusters as well as an expected behavior regarding HA failovers when the following conditions are met:

The override setting is not enabled for any HA cluster members.
An HA monitored interface goes down and then quickly recovers on the Primary FortiGate.
And, failover-hold-time is configured to a non-zero value.

Important: The failover-hold-time setting is not meant to prevent HA failovers from occurring when monitored interfaces are flapping (rapidly going down and coming back online), and instead it is meant to reduce the frequency with which failovers occur. The article discusses the exact mechanisms for this feature in further detail below.

Scope

FortiGate.

Solution

As a primer, the failover-hold-time setting under config system ha was first added in FortiOS v7.0.0 and is used to specify a length of time in seconds after a monitored interface changes state (either going down OR coming back up) that the Primary FortiGate should wait before it should potentially trigger an HA failover. See also: Technical Tip: How to configure HA failover delay for monitored ports.

However, it is important to understand that two separate actions occur whenever an HA monitored interface changes state, and these actions are triggered after the failover-hold-time expires following the most recent change in interface state:

The HA cluster compares the number of monitored interfaces that are online between the Primary and Secondary FortiGates and will conduct a failover if the Primary has fewer online interfaces.
The HA cluster will reset the HA cluster uptime of the FortiGate that had the monitored interface change state. Notably, this always occurs after the failover-hold-time expires, even if the monitored interface recovers before the timer expires.

Additionally, as per the HA primary unit selection criteria documentation, the default process for electing HA Primary units is based on Monitored Interfaces -> Uptime -> Priority -> Serial Number, but the order can be changed to Monitored Interfaces -> Priority -> Uptime -> Serial Number if override is enabled for one or more of the FortiGate cluster members. What this ultimately means is that the failover-hold-time setting can produce different effects depending on whether or not override has been enabled for one of the cluster members.

Consider the following scenario where a) failover-hold-time is set to 5 seconds, b) a monitored interface on the Primary FortiGate goes down and then recovers within 2 seconds, and c) the Primary has a superior/higher priority to the Secondary:

If override is disabled, then the HA cluster uptimes are compared between the Primary and Secondary FortiGate, since the number of monitored interfaces is equal. Since the uptime of the Primary FortiGate is always reset to 0, the Secondary FortiGate has a superior uptime, and an HA failover is triggered once failover-hold-time expires.
If override is enabled, then the HA priority is compared between the Primary and Secondary FortiGates. The Primary has a superior priority, and so it remains the cluster primary, and an HA failover is not triggered.

Example Scenario:

The example below further demonstrates how the FortiGate behaves when a monitored interface goes down and then recovers before the failover hold timer. For reference, the following HA configuration has override disabled and a failover-hold-time set to 60 seconds:

config system ha

set group-id 2
set group-name "group1"
set mode a-p
set hbdev "ha1" 100 "ha2" 0
set session-pickup enable
set override disable
set priority 200
set monitor "LACP-1" "LACP-2"
set failover-hold-time 60

end

In the following System Event logs, the monitored interface can be seen going down and then recovering within 17 seconds:

date="2025-02-06" time="14:03:45" id=7468291105147062624 bid=48863959 dvid=107 itime=1738847025 euid=3 epid=3 dsteuid=3 dstepid=3 logver=700140601 logid="0100020099" type="event" subtype="system" level="warning" action="interface-stat-change" msg="Link monitor: Interface LACP-1 was turned down" logdesc="Interface status changed" status="DOWN" eventtime=1738847025636361195 tz="+0100" devid="FG100XXX" vd="root" devname="FR530"

date="2025-02-06" time="14:04:02" id=7468291178161505338 bid=48863977 dvid=107 itime=1738847042 euid=3 epid=3 dsteuid=3 dstepid=3 logver=700140601 logid="0100020099" type="event" subtype="system" level="warning" action="interface-stat-change" msg="Link monitor: Interface LACP-1 was turned up" logdesc="Interface status changed" status="UP" eventtime=1738847042336333855 tz="+0100" devid="FG100XXX" vd="root" devname="FR530"

Likewise, the output of diagnose sys ha history read will show similar information. Note how a failover was triggered 60 seconds after the monitored interface came back online. This is expected, as the failover-hold-time can be triggered by general state changes for monitored interfaces (going down or coming back online).

<2025-02-06 14:05:03> FG100FTKXXXX is elected as the cluster primary of 2 member <---- Failover happens exactly 1 minute after the monitored interface comes up.
<2025-02-06 14:04:03> port INT-LACP-1 link status changed: 0->1 <---- Monitored interface comes up 17 seconds after.
<2025-02-06 14:04:02> hbdev ha2 link status changed: 0->1
<2025-02-06 14:04:02> port ha2 link status changed: 0->1
<2025-02-06 14:03:46> port INT-LACP-1 link status changed: 1->0 <---- Monitored interface goes down.

Conclusion:

As per the above demonstration, failover-hold-time is not meant to directly prevent failovers from occurring when a monitored interface goes down and then recovers quickly (aka 'interface flapping'). Instead, it is only meant to add a delay to the HA cluster so that it does not constantly perform a cluster election assessment. To fully prevent HA failovers from occurring due to monitored interface flapping, the override setting must also be set so that the Primary FortiGate can retain the HA primary role after the failover-hold-time expires (even though the override setting and failover-hold-time are not directly related to one another).

Technical Tip: HA Cluster failover still occurs even when monitored interface recovers before failover-hold-time expires

You are leaving our website