Troubleshooting Tip: How to troubleshoot HA 'Heartbeat packet lost' issues in a FortiGate HA Cluster

sthampi_FTNT · ‎08-07-2023

Description

This article describes 'Heartbeat packet lost' errors in HA log messages and offers ways to identify the root cause and fix it.

Scope

FortiGate running in HA mode (FGCP HA Active-Passive or Active-Active).

Solution

When a FortiGate is running in HA FGCP mode, Active-Passive or Active-Active, HA Log messages may appear indicating there was a 'Heartbeat packet lost':

date=2023-05-30 time=18:23:19 eventtime=1685463799973776644 tz="+0200" logid="0108037910" type="event" subtype="ha" level="critical" vd="root" logdesc="Heartbeat packet lost" msg="Heartbeat packet lost" ha_role="primary" devintfname="port9"

This log message means that the HA Peer did not receive the HA Heartbeat packet within the HA Hold-down timer.
For example, if the Heartbeat packets are not received within 1.2 seconds (1.2 seconds is the default value - a calculation is shown below). See the HA heartbeat section of the FortiGate cookbook for more information.more information in the article here.

hb-interval * hb-interval-in-milliseconds * hb-lost-threshold = 2 * 100 millisecond * 6 = 1.2 seconds (default values)

This article provides the troubleshooting tips required to narrow down the cause of this issue.

Troubleshooting steps:

Run get system ha status to see if there are dropped packets or errors on the heartbeat interface

get system ha status | grep "ha1"
ha1: physical/10000auto, up, rx-bytes/packets/dropped/errors=8013531615/5586193/486/0, tx=256040866/180012/0/0
ha1: physical/10000auto, up, rx-bytes/packets/dropped/errors=8005393949/5575798/549/0, tx=261722674/182376/0/0

Run diagnose hardware device info nic <heartbeat port number> to see if there are dropped packets in the physical interface counters:

diagnose hardware deviceinfo nic ha1 | grep drop
rx_dropped 4333
tx_dropped 0
port.rx_dropped 0
port.tx_dropped_link_down 0

Check for high CPU usage due to user level processes. The HA Heartbeat packet is generated by the 'hatalk' process. Identify this by running the hatalk debug.

diag debug application hatalk -1

diag debug application hasync -1
diag debug enable
<hatalk> cfg_changed is set to 0: hatalk_packet_setup_heartbeat
<hatalk> setup new heartbeat packet: hbdev='port22', packet_version=473
<hatalk> options buf is small: opt_type=41(DEVINFO), opt_sz=13806, buf_sz=8726
<hatalk> pack compressed dev_info: dev_nr=37, orig_sz=13800, z_len=244

The output of 'diag sys top' can be used to verify whether there is high CPU usage with the 'hatalk' process.

Having high CPU usage in the hasync process should not directly affect the heartbeat packet processing. However, if it is high and if both processes are running on the same CPU core, it could have an indirect effect.
Similarly, if there is constant high CPU usage due to other processes, that needs to be addressed as well.

Check for high CPU usage due to a system or kernel process such as irq or softirq. If there is high CPU usage on a specific CPU core or a group of CPU cores, it will have a direct effect on the HA heartbeat packet processing. This will make it necessary to identify and fix such high CPU issues.

Verify these issues with the following commands:

get system performance status

CPU states: 0% user 99% system 0% nice 1% idle 0% iowait 0% irq 0% softirq
CPU0 states: 0% user 99% system 0% nice 1% idle 0% iowait 0% irq 0% softirq
CPU1 states: 0% user 98% system 0% nice 2% idle 0% iowait 0% irq 0% softirq

diag sys mpstat 1

diag sys profile
start start kernel profiling data
stop copy kernel profiling data
show show kernel profiling result
sysmap show kernel sysmap
cpumask profile which CPUs [Take 0-10 arg(s)]
step set profile step
module show kernel module

fnsysctl cat /proc/interrupts
fnsysctl cat /proc/softirqs

From the interrupts counter, find the CPU core number that receives the heartbeat packet and check if there is short spike in CPU for those Cores.

For example: the output of the command 'fnsysctl cat /proc/interrupts' shows that the TxRx Queues of ha1 heartbeat interface [i40e-ha1-TxRx-31 , i40e-ha1-TxRx-33] are mapped to CPU core CPU75 and CPU77 respectively.

Similarly, it is necessary to find the core numbers for other host queues of the heartbeat interface and verify if there are CPU spikes on these cores. Take actions as necessary if these are present.

Collect heartbeat packet captures during the 'heartbeat packet loss' issue from both the primary and secondary units, then use them to verify whether the heartbeat packets sent from the primary are received on the secondary and vice versa.

Packet capture commands:

HA primary:

diag sniffer packet any 'ether proto 0x8890' 4 0 l | grep ha1
2023-06-05 16:52:15.630003 ha1 out Ether type 0x8890 printer hasn't been added to sniffer.
2023-06-05 16:52:15.698791 ha1 in Ether type 0x8890 printer hasn't been added to sniffer.
2023-06-05 16:52:15.740012 ha1 out Ether type 0x8890 printer hasn't been added to sniffer.
2023-06-05 16:52:15.798792 ha1 in Ether type 0x8890 printer hasn't been added to sniffer.
2023-06-05 16:52:15.840003 ha1 out Ether type 0x8890 printer hasn't been added to sniffer.

HA secondary:

diag sniffer packet any 'ether proto 0x8890' 4 0 l | grep ha1
23-06-05 16:52:15.822283 ha1 out Ether type 0x8890 printer hasn't been added to sniffer.
2023-06-05 16:52:15.863515 ha1 in Ether type 0x8890 printer hasn't been added to sniffer.
2023-06-05 16:52:15.932283 ha1 out Ether type 0x8890 printer hasn't been added to sniffer.

If there is a difference in the number of packets sent and received in the previous step, it is important to double-check the connectivity between the FortiGate HA Members. If the connection goes through a switch and VLANs are used, use dedicated VLANs for heartbeat traffic so that it doesn't interfere with other data traffic.
In units that have a dedicated HA interface (for example, HA1 and HA2 ports), it is recommended to use these ports because they are directly connected to the CPU of the FortiGate and not connected through an NPU or ISF.

If a network port is used for exchanging heartbeat packets (i.e. port1-port32), it should be chosen carefully by looking at the internal hardware schematics of that particular FortiGate model. For example, the internal schematics of FortiGate 3600E differ from those of Fortigate 3700D.
If the chosen heartbeat port shares the same internal path as a heavily used network interface, it could lead to sub-optimal packet processing.
Additionally, it is possible to increase the heartbeat timers to increase the fault tolerance. See this article for more information.
Additionally, it is possible to improve the HA heartbeat traffic and performance of HA cluster with selection one or more ports to use for a synchronizing session as explained in this article
By default, session synchronization activity takes place over the HA heartbeat link using TCP/703 and UDP/703.
If there is a large number of session synchronizations, this can cause network congestion and impact the HA cluster communication.
Collect the HA Event logs, System Event logs, and Traffic logs during the time of the issue and share them with TAC. See this article for instructions on how to gather them.

Troubleshooting Tip: How to troubleshoot HA 'Heartbeat packet lost' issues in a FortiGate HA Cluster

You are leaving our website