Technical Tip: Changing the HA heartbeat timers to prevent false failover

oconnort · ‎01-13-2021

Description

This article explains how to change HA heartbeat timers to prevent false or unwanted failover from occurring.

Scope

FortiGate, HA clusters.

Solution

If a cluster unit CPU in an HA cluster becomes very busy, the cluster unit may not be able to send heartbeat packets before the heartbeat timer elapses.
When heartbeat packets are not sent in time, the cluster may experience a failover as other units report that the busy cluster unit did not respond.

A cluster unit CPU may become very busy if the cluster is subject to a syn flood attack, if network traffic is very heavy, or for other similar reasons.

Use the configuration parameters in this article to configure how the timer for HA heartbeat packets:

# hb-lost-threshold <threshold_integer>

The lost heartbeat threshold is the number of consecutive heartbeat packets that must not be received from another cluster unit before the unit is assumed to have failed.
The default value is 6 (this can differ by model. Ex: For VM models the default can be 20), meaning that if 6 heartbeat packets are not received from a cluster unit in a row, that cluster unit is considered to have failed. The range is 1 to 60 packets.

If the primary cluster unit does not receive a heartbeat packet from a subordinate unit before the heartbeat threshold expires, the primary unit assumes that the subordinate unit has failed.

The same occurs in reverse if the subordinate unit does not receive a heartbeat packet from the primary unit, which causes the subordinate unit to begin negotiating to become the new primary unit.

The lower the lost heartbeat interval, the faster the cluster responds to a failure. However, the heartbeat lost threshold can be increased if repeated failovers occur because cluster units cannot send heartbeat packets quickly enough.

# hello-holddown <holddown_integer>

The hello state hold-down time is the number of seconds that a cluster unit waits before changing from 'hello' state to 'work' state. A cluster unit changes from hello state to work state when it starts up.
The hello state hold-down time range is 5 to 300 seconds. The hello state hold-down time default is 20 seconds.

# hb-interval <interval_integer>

The heartbeat interval is the time between sending heartbeat packets.
The heartbeat interval range is 1 to 20 (100*ms). The heartbeat interval default is 2 (200 ms).

A heartbeat interval of 2 means the time between heartbeat packets is 200 ms. Changing the heartbeat interval to 5 changes the time between heartbeat packets to 500 ms.

HA heartbeat packets consume more bandwidth if the hb-interval is short. However, if the hb-interval is very long, the cluster is not as sensitive to topology and other network changes.

Example configuration:

config system ha
   set hb-lost-threshold 6
   set hello-holddown 20
   set hb-interval 2
end

In this configuration example, if a unit does not receive 6 consecutive heartbeat packets (6*200ms = 1.2 seconds) from a unit, that cluster unit is considered to have failed.

Related article:

Troubleshooting Tip: Fortigate HA message 'HA master heartbeat interface intf_name lost neighbor inf...

Ahmed_M · ‎09-23-2022

Regarding hb-lost-threshold, this article is not accurate,

for example

config system ha
set hb-lost-threshold 6
set hb-interval 2
end

hb-lost-threshold = 6 doesn't mean when 6 consecutive HB packets were lost it will consider ha pear is lost.. NO, hb-lost-threshold is not a packet counter but it's a timer, first calculate the lost time 6*2*100 = 1200 ms, so according to the lost threshold a hb packet should must arrive befor 1.2 seconds to consider the peer not lost.

Even if there is a zero lost HB packet, but a single packet was delayed and took it more than 1200ms to arrive (e.g 1.5 second) , is enough to announce the peer is lost.. There is a huge difference.

oconnort · ‎11-28-2022

It does not mention it being a counter, but a value in which 6 heartbeat packages should have been received, which as you have correctly stated would be the time of hb-lost-threshold*hb-interval*100 to determine the ms in which it will assume the other cluster to have failed.