Skip to main content
Sepideh
Staff
Staff
March 18, 2026

Troubleshooting Tip: LACP Interface Flaps and HA Heartbeat Packet Loss

  • March 18, 2026
  • 0 replies
  • 1136 views
Description

This article describes how to troubleshoot LACP interface flaps and HA heartbeat packet loss issues on FortiGate 4401F appliances; however, the steps can also apply to other FortiGate models. This article addresses scenarios in which logical aggregate interfaces experience intermittent LACP flaps, and as a result of resolving these flaps, the HA heartbeat interfaces may experience heartbeat packet loss. 

Scope FortiGate.
Solution

Scenario 1: Aggregate interface flaps.

In this section, the focus is on the intermittent flapping of logical aggregate interfaces. The key components addressed include symptoms, troubleshooting, root cause analysis, and resolution steps.

 

Symptoms:

The following symptoms were the initial indicators that led to the start of the troubleshooting process:

  • Logical aggregate interfaces intermittently go up and down. In a specific case, this was observed during a particular time of day when peak traffic bursts occurred.

 

type="event" subtype="system" level="warning" vd="root" logdesc="Interface status changed" action="interface-stat-change" status="UP" msg="Link monitor: Interface AEx was turned up"

type="event" subtype="system" level="warning" vd="root" logdesc="Interface status changed" action="interface-stat-change" status="DOWN" msg="Link monitor: Interface AEx was turned down"

 

  • When the physical switches connected to the FortiGate interface experiencing LACP flaps were examined, it was observed that the physical interfaces on the switch were going down. However, on FortiGate, the physical interfaces underlying the logical ones were not affected, and only the logical interfaces themselves were seen flapping.
  • As a result of these interface flaps, FortiGuard sessions were frequently disrupted, and BGP sessions were also affected.

 

type="event" subtype="system" level="warning" vd="root" logdesc="FortiGuard webfilter reachable" msg="Fortiguard webfilter services are reachable"

type="event" subtype="system" level="error" vd="root" logdesc="URL filter packet send failure" process="urlfilter" reason="Connection refused" msg="failed to send urlfilter packet"

type=event subtype=router level=warning msg=BGP: %BGP-5-ADJCHANGE: VRF 0 neighbor x.x.x.x Down User reset logdesc=BGP neighbor status changed

type=event subtype=router level=warning msg=BGP: %BGP-5-ADJCHANGE: VRF 0 neighbor x.x.x.x Up logdesc=BGP neighbor status changed

 

  • A review of bandwidth utilization graphs revealed traffic spikes during these periods, indicating bursts in traffic load.

 

Troubleshooting:

In this section, the steps taken to identify the root cause of the issue are described:

  • Both the complete event logs and the forward traffic logs were collected for further analysis.
  • A monitoring script was created, which included the commands below, to assess the CPU, NP7, memory, sessions and network interface card:

 

The attached Teraterm Macro 'Script_LACP_NP7.ttl' is an example where an LACP interface is configured using the physical interfaces port30 and port31. It is necessary to adapt this script to the interfaces names, physical ports and NP7_ID of the reported Fortigate platform.


Note: If the issue is intermittent, it is important to collect this information due to in some cases the workaround done by customers is to reboot the devices in order to recover the services as soon as possible.

 

The logs obtained after the reboot (e.g., crashlog, top-mem, tac-report) is helpful but often insufficient for proactive root cause analysis, particulary in cases involving NP7 leaks, abnormal packet drops or sustained high CPU utilization.
Without any debug or NP7 counter information captured during the issue, TAC/DEV team are not able to determine the root cause of the issue. As a result, technical investigations frequently require multiple follow-up cycles with customer and additional data collection after subsequent ocurrence of the issue

Refer to the following articles about Script monitoring:


SSH session 1
– Run the following commands every minute:

 

get system status

sudo global get system performance status

diagnose sys session full-stat

diagnose sys session stat

diagnose sys session exp-stat

diagnose hardware sysinfo slab

diagnose sys top 2 30 3

diagnose sys profile report

fnsysctl cat /proc/net/snmp

sudo root diagnose snmp ip frags

sudo root diagnose netlink device list

sudo global diagnose hardware deviceinfo  nic  port17

sudo global diagnose hardware deviceinfo  nic  port18

sudo global diagnose hardware deviceinfo  nic  port19

sudo global diagnose hardware deviceinfo  nic  port20

diagnose sys vd list

sudo root d netlink interface packet-rate

sudo root d snmp ip frags

sudo root d snmp ip ip

diagnose har sys mem

diagnose har sys inter

sudo root diagnose ips raw st

diagnose sys sip status

fnsysctl date

 

SSH session 2:

 

diagnose sys mpstat 2

 

SSH session 3:

 

diagnose npu np7 dce-drop-all all

diagnose npu np7 cgmac-stats all

diagnose npu np7 hif-stats all

diagnose npu np7 pba all

diagnose npu np7 pdq all

diagnose npu np7 pmon all

diagnose npu np7 sse-stats all

diagnose npu np7 session-offload-stats all

diagnose npu np7 dsw-ingress-stats all

diagnose npu np7 dsw-egress-stats all

diagnose npu np7 msg htab-rate all

diagnose npu np7 msg sse-rate all

fnsysctl cat /proc/net/np7/np7_0/tbl/cdb_spv_htab_csr_info

fnsysctl cat /proc/net/np7/np7_0/tbl/cdb_tpv_htab_csr_info

 

For more information about NP7 troubleshooting, refer to Troubleshooting Tip: NP7 troubleshooting.

 

  • Moreover, during the traffic burst period, the following sniffer was decided to be run to capture LACP PDU details to check for any dropped LACP PDUs and provide deeper insight into the problem:

 

diagnose sniffer packet any "ether proto 0X8809" 6 0 a

 

Root cause analysis:

After a thorough and detailed review of the collected logs, it was found that the NP queue became heavily congested while handling traffic during the burst. As the bandwidth spike occurred, LACP packets were not transmitted, which resulted in LACP flapping. The traffic burst caused NP congestion, preventing the timely handling of LACP packets mainly because the LACP packets were processed with lower priority.

 

This problem is not directly tied to the overall traffic load, but rather to a sudden surge in packet volume. During these bursts, a large number of packets require processing, leading to spikes, which can be seen in the bandwidth graphs. That’s why even with 100 Gbps LACP member interfaces, traffic as low as for example 4 Gbps could trigger NP congestion.

 

Resolution steps:

The following resolution steps can be used to address the issue where LACP packets are processed with lower priority, leading to LACP interface flaps.

 

The primary solution is to upgrade the device to FortiOS 7.2.12 GA or later versions, where setting a dedicated high-priority LACP queue is available in FortiOS.

 

Additionally, the following are the configuration optimization recommendations to improve conditions prior and after upgrading to FortiOS 7.2.12:

 

  1. Configuring HA1/HA2 to use NP7 for session sync:

 

This step should only be performed if the traffic volume is below 20 Gbps. To assess this, it is recommended to review the bandwidth utilization graphs for the past six months and confirm that the traffic load does not exceed 20 Gbps. Additionally, this step must be performed prior to upgrading to FortiOS 7.2.12.

 

config system npu

   config port-path-option

      set ports-using-npu "ha1" "ha2"

   end

end

 

config system ha

   set group-id x

   set group-name <name>

   set mode a-p

   set sync-packet-balance enable <----- The secondary unit uses multiple cpu to receive session sync.

   set session-sync-dev "ha1" "ha2" <----- The primary unit uses multiple cpu to send session sync.

   set hbdev "ha1" 100 "ha2" 50

   set session-pickup enable

   set session-pickup-connectionless enable

   set override disable

   set priority 200

end

 

  1. SIP handling.

 

Unset default-voip-alg-mode, handle SIP bursts using a daemon instead of the kernel:

 

config vdom

    edit <vdom name>

        config system settings

            set default-voip-alg-mode kernel-helper-based <----– Unset default-voip-alg-mode.

        end

    end

 

  1. LACP speed adjustment.

 

Change LACP speed from fast (1s) to normal. (Note: This needs to be changed on the connected switch as well.)

 

config system interface

   edit <aggregate interface name>

      unset lacp-speed

   next

end

 

  1. High-priority LACP queue

 

Raise a dedicated high-priority LACP queue (after upgrading to FortiOS 7.2.12):

 

config system npu

    set dedicated-lacp-queue enable

        config np-queues

            config ethernet-type

                edit "LACP"

                    set type 8809

                    set queue 11

               next

           end

 

Note: After upgrading to FortiOS 7.2.12, it may be observed that traffic is being blocked by UTM application control, with the action in the logs showing as 'server-rst'. This is due to a change of behavior introduced in FortiOS 7.2.12 GA. Details are provided in the following documents:

 

 

To resolve this issue, the 'set cert-probe-failure' setting can be changed to 'allow', which will revert the behavior to that seen in earlier FortiOS versions:

 

config firewall ssl-ssh-profile

    edit <profile_name>

        config https

            set cert-probe-failure allow

        end

    end

 

Scenario 2: High CPU / packet loss impacts.

 

After the recommended configuration optimizations are applied and FortiGate is upgraded to version FortiOS 7.2.12, which introduces the ability to create priority queue for LACP packets, the LACP interface flap issue is resolved. However, a new issue with HA heartbeat packet loss may be encountered.

 

In the following sections, the symptoms will be detailed, the troubleshooting steps taken will be explained, the root cause will be analyzed, and the resolution steps that resolved the issue will be provided.

 

Symptoms:

The initial signs of the issue were observed as HA problems on FortiGate cluster a few hours after the upgrade. Heartbeat packet loss was consistently seen on the HA1 and HA2 interfaces of Primary FortiGate, and even with failover occurring, the same issue appeared on secondary firewall as well.

 

type=event subtype=ha level=critical msg=Heartbeat packet lost logdesc=Heartbeat packet lost ha_role=primary devintfname=ha2 eventtime=1768221218 tz=-0500 devname="FG441F"

type=event subtype=ha level=critical msg=Heartbeat packet lost logdesc=Heartbeat packet lost ha_role=secondary devintfname=ha2 eventtime=1768221158 tz=-0500 devname="FG441F"

type=event subtype=ha level=critical msg=Heartbeat packet lost logdesc=Heartbeat packet lost ha_role=primary devintfname=ha1 eventtime=1768221065 tz=-0500 devname="FG441F"

type=event subtype=ha level=critical msg=Heartbeat packet lost logdesc=Heartbeat packet lost ha_role=secondary devintfname=ha1 eventtime=1768221063 tz=-0500 devname="FG441F"

 

Another notable indicator in this case was that the observed bandwidth reached approximately 25 Gbps.

 

Troubleshooting:

For a more in-depth understanding of the issue, the output from the monitoring script (described in the previous section) was collected along with complete event logs and FortiGate configuration file, and all were analyzed in detail.

 

Root cause analysis:

After the upgrade to FortiOS 7.2.12 GA, once the traffic began to be processed by the FortiGate, HA heartbeat packet loss started to occur a few hours later.

  • CPU spikes were observed during the issue.
  • The number of concurrent sessions reached several million.
  • A traffic peak of approximately 25 Gbps was noted.
  • Both ingress and egress traffic volumes were similar.
  • Multiple firewall policies had deep inspection enabled, along with UTM profiles such as Antivirus, IPS, and web filtering.

 

The upgrade to 7.2.12 GA resolved the LACP interface flap issue; however, the HA configuration change (Configuring HA1/HA2 to use NP7 for session sync) that was recommended as an optimization worsened the situation, resulting in a split-brain condition for multiple times. Each time, HA recovered on its own within one second.

 

The overall observed bandwidth of 25 Gbps was the reason that this optimization recommendation did not work as expected. The original observation in this case study was around 4 Gbps, but it was much higher upon deeper investigation upon examining the bandwidth graphs from the past six months. This is why the workaround, which was NOT recommended for traffic above 20 Gbps, led to this unintended situation when it was applied in this case.

 

Resolution steps:

To resolve the HA heartbeat packet loss issue, the following solutions can be implemented:

 

  1. Revert the config for HA1/HA2 to use NP7 for session sync:

 

config system npu

    set dedicated-lacp-queue enable

        config port-path-option

            unset ports-using-npu

        end

    end

 

  1. Optimize ha heartbeat and session-sync performance.

 

Set ha1/ha2/aux1/aux2 to 25G (in FortiGate 4401F), and connect them back-to-back:

 

config system interface

    edit "ha1"

        set vdom "root"

        set forward-error-correction cl91-rs-fec

        set speed 25000full

    next

    edit "ha2"

        set vdom "root"

        set forward-error-correction cl91-rs-fec

        set speed 25000full

    next

    edit "aux1"

        set vdom "root"

        set forward-error-correction cl91-rs-fec

        set speed 25000full

        set mtu-override enable

        set mtu 9000

    edit "aux2"

        set vdom "root"

        set forward-error-correction cl91-rs-fec

        set speed 25000full

        set mtu-override enable

        set mtu 9000

    next

end

 

Note: Change MTU of aux1/aux2 to 9k for better session-sync performance.

 

Separate HA heartbeat interfaces (ha1/ha2) from session-sync interfaces (aux1/aux2) and increase HA heartbeat loss threshold:

 

config system ha

    set group-id 45

    set group-name <name>

    set mode a-p

    set sync-packet-balance enable

    set hbdev "ha1" 20 "ha2" 10

    set session-sync-dev "aux1" "aux2"

    set hb-interval 4

    set hb-lost-threshold 12

    set session-pickup enable

    set session-pickup-connectionless enable

    set override disable

end

 

  1. Move HA heartbeat ha1/ha2 tx/rx to appropriate CPU cores.

 

To implement this adjustment, it is advised to engage Fortinet support for assistance. See the Fortinet Support Portal: Welcome to Fortinet Support.

 

Conclusion:

In summary, the initial LACP interface flap was due to a traffic burst causing NP congestion and preventing the timely handling of LACP packets mainly because the LACP packets were processed with lower priority. After upgrading to FortiOS 7.2.12 GA and implementing the recommended configuration optimization changes, a split-brain scenario emerged due to unexpected high bandwidth (Over 20 Gbps) and CPU overload. By carefully applying the recommended configuration changes and ensuring UTM features are balanced, stable HA behavior and efficient traffic handling can be achieved.