Troubleshooting Tip: LACP Interface Flaps and HA Heartbeat Packet Loss
| Description | This article describes how to troubleshoot LACP interface flaps and HA heartbeat packet loss issues on FortiGate 4401F appliances; however, the steps can also apply to other FortiGate models. This article addresses scenarios in which logical aggregate interfaces experience intermittent LACP flaps, and as a result of resolving these flaps, the HA heartbeat interfaces may experience heartbeat packet loss. |
| Scope | FortiGate. |
| Solution | Scenario 1: Aggregate interface flaps. In this section, the focus is on the intermittent flapping of logical aggregate interfaces. The key components addressed include symptoms, troubleshooting, root cause analysis, and resolution steps.
Symptoms: The following symptoms were the initial indicators that led to the start of the troubleshooting process:
type="event" subtype="system" level="warning" vd="root" logdesc="Interface status changed" action="interface-stat-change" status="UP" msg="Link monitor: Interface AEx was turned up" type="event" subtype="system" level="warning" vd="root" logdesc="Interface status changed" action="interface-stat-change" status="DOWN" msg="Link monitor: Interface AEx was turned down"
type="event" subtype="system" level="warning" vd="root" logdesc="FortiGuard webfilter reachable" msg="Fortiguard webfilter services are reachable" type="event" subtype="system" level="error" vd="root" logdesc="URL filter packet send failure" process="urlfilter" reason="Connection refused" msg="failed to send urlfilter packet" type=event subtype=router level=warning msg=BGP: %BGP-5-ADJCHANGE: VRF 0 neighbor x.x.x.x Down User reset logdesc=BGP neighbor status changed type=event subtype=router level=warning msg=BGP: %BGP-5-ADJCHANGE: VRF 0 neighbor x.x.x.x Up logdesc=BGP neighbor status changed
Troubleshooting: In this section, the steps taken to identify the root cause of the issue are described:
The attached Teraterm Macro 'Script_LACP_NP7.ttl' is an example where an LACP interface is configured using the physical interfaces port30 and port31. It is necessary to adapt this script to the interfaces names, physical ports and NP7_ID of the reported Fortigate platform.
The logs obtained after the reboot (e.g., crashlog, top-mem, tac-report) is helpful but often insufficient for proactive root cause analysis, particulary in cases involving NP7 leaks, abnormal packet drops or sustained high CPU utilization. Refer to the following articles about Script monitoring:
get system status sudo global get system performance status diagnose sys session full-stat diagnose sys session stat diagnose sys session exp-stat diagnose hardware sysinfo slab diagnose sys top 2 30 3 diagnose sys profile report fnsysctl cat /proc/net/snmp sudo root diagnose snmp ip frags sudo root diagnose netlink device list sudo global diagnose hardware deviceinfo nic port17 sudo global diagnose hardware deviceinfo nic port18 sudo global diagnose hardware deviceinfo nic port19 sudo global diagnose hardware deviceinfo nic port20 diagnose sys vd list sudo root d netlink interface packet-rate sudo root d snmp ip frags sudo root d snmp ip ip diagnose har sys mem diagnose har sys inter sudo root diagnose ips raw st diagnose sys sip status fnsysctl date
SSH session 2:
diagnose sys mpstat 2
SSH session 3:
diagnose npu np7 dce-drop-all all diagnose npu np7 cgmac-stats all diagnose npu np7 hif-stats all diagnose npu np7 pba all diagnose npu np7 pdq all diagnose npu np7 pmon all diagnose npu np7 sse-stats all diagnose npu np7 session-offload-stats all diagnose npu np7 dsw-ingress-stats all diagnose npu np7 dsw-egress-stats all diagnose npu np7 msg htab-rate all diagnose npu np7 msg sse-rate all fnsysctl cat /proc/net/np7/np7_0/tbl/cdb_spv_htab_csr_info fnsysctl cat /proc/net/np7/np7_0/tbl/cdb_tpv_htab_csr_info
For more information about NP7 troubleshooting, refer to Troubleshooting Tip: NP7 troubleshooting.
diagnose sniffer packet any "ether proto 0X8809" 6 0 a
Root cause analysis: After a thorough and detailed review of the collected logs, it was found that the NP queue became heavily congested while handling traffic during the burst. As the bandwidth spike occurred, LACP packets were not transmitted, which resulted in LACP flapping. The traffic burst caused NP congestion, preventing the timely handling of LACP packets mainly because the LACP packets were processed with lower priority.
This problem is not directly tied to the overall traffic load, but rather to a sudden surge in packet volume. During these bursts, a large number of packets require processing, leading to spikes, which can be seen in the bandwidth graphs. That’s why even with 100 Gbps LACP member interfaces, traffic as low as for example 4 Gbps could trigger NP congestion.
Resolution steps: The following resolution steps can be used to address the issue where LACP packets are processed with lower priority, leading to LACP interface flaps.
The primary solution is to upgrade the device to FortiOS 7.2.12 GA or later versions, where setting a dedicated high-priority LACP queue is available in FortiOS.
Additionally, the following are the configuration optimization recommendations to improve conditions prior and after upgrading to FortiOS 7.2.12:
This step should only be performed if the traffic volume is below 20 Gbps. To assess this, it is recommended to review the bandwidth utilization graphs for the past six months and confirm that the traffic load does not exceed 20 Gbps. Additionally, this step must be performed prior to upgrading to FortiOS 7.2.12.
config system npu config port-path-option set ports-using-npu "ha1" "ha2" end end
config system ha set group-id x set group-name <name> set mode a-p set sync-packet-balance enable <----- The secondary unit uses multiple cpu to receive session sync. set session-sync-dev "ha1" "ha2" <----- The primary unit uses multiple cpu to send session sync. set hbdev "ha1" 100 "ha2" 50 set session-pickup enable set session-pickup-connectionless enable set override disable set priority 200 end
Unset default-voip-alg-mode, handle SIP bursts using a daemon instead of the kernel:
config vdom edit <vdom name> config system settings set default-voip-alg-mode kernel-helper-based <----– Unset default-voip-alg-mode. end end
Change LACP speed from fast (1s) to normal. (Note: This needs to be changed on the connected switch as well.)
config system interface edit <aggregate interface name> unset lacp-speed next end
Raise a dedicated high-priority LACP queue (after upgrading to FortiOS 7.2.12):
config system npu set dedicated-lacp-queue enable config np-queues config ethernet-type edit "LACP" set type 8809 set queue 11 next end
Note: After upgrading to FortiOS 7.2.12, it may be observed that traffic is being blocked by UTM application control, with the action in the logs showing as 'server-rst'. This is due to a change of behavior introduced in FortiOS 7.2.12 GA. Details are provided in the following documents:
To resolve this issue, the 'set cert-probe-failure' setting can be changed to 'allow', which will revert the behavior to that seen in earlier FortiOS versions:
config firewall ssl-ssh-profile edit <profile_name> config https set cert-probe-failure allow end end
Scenario 2: High CPU / packet loss impacts.
After the recommended configuration optimizations are applied and FortiGate is upgraded to version FortiOS 7.2.12, which introduces the ability to create priority queue for LACP packets, the LACP interface flap issue is resolved. However, a new issue with HA heartbeat packet loss may be encountered.
In the following sections, the symptoms will be detailed, the troubleshooting steps taken will be explained, the root cause will be analyzed, and the resolution steps that resolved the issue will be provided.
Symptoms: The initial signs of the issue were observed as HA problems on FortiGate cluster a few hours after the upgrade. Heartbeat packet loss was consistently seen on the HA1 and HA2 interfaces of Primary FortiGate, and even with failover occurring, the same issue appeared on secondary firewall as well.
type=event subtype=ha level=critical msg=Heartbeat packet lost logdesc=Heartbeat packet lost ha_role=primary devintfname=ha2 eventtime=1768221218 tz=-0500 devname="FG441F" type=event subtype=ha level=critical msg=Heartbeat packet lost logdesc=Heartbeat packet lost ha_role=secondary devintfname=ha2 eventtime=1768221158 tz=-0500 devname="FG441F" type=event subtype=ha level=critical msg=Heartbeat packet lost logdesc=Heartbeat packet lost ha_role=primary devintfname=ha1 eventtime=1768221065 tz=-0500 devname="FG441F" type=event subtype=ha level=critical msg=Heartbeat packet lost logdesc=Heartbeat packet lost ha_role=secondary devintfname=ha1 eventtime=1768221063 tz=-0500 devname="FG441F"
Another notable indicator in this case was that the observed bandwidth reached approximately 25 Gbps.
Troubleshooting: For a more in-depth understanding of the issue, the output from the monitoring script (described in the previous section) was collected along with complete event logs and FortiGate configuration file, and all were analyzed in detail.
Root cause analysis: After the upgrade to FortiOS 7.2.12 GA, once the traffic began to be processed by the FortiGate, HA heartbeat packet loss started to occur a few hours later.
The upgrade to 7.2.12 GA resolved the LACP interface flap issue; however, the HA configuration change (Configuring HA1/HA2 to use NP7 for session sync) that was recommended as an optimization worsened the situation, resulting in a split-brain condition for multiple times. Each time, HA recovered on its own within one second.
The overall observed bandwidth of 25 Gbps was the reason that this optimization recommendation did not work as expected. The original observation in this case study was around 4 Gbps, but it was much higher upon deeper investigation upon examining the bandwidth graphs from the past six months. This is why the workaround, which was NOT recommended for traffic above 20 Gbps, led to this unintended situation when it was applied in this case.
Resolution steps: To resolve the HA heartbeat packet loss issue, the following solutions can be implemented:
config system npu set dedicated-lacp-queue enable config port-path-option unset ports-using-npu end end
Set ha1/ha2/aux1/aux2 to 25G (in FortiGate 4401F), and connect them back-to-back:
config system interface edit "ha1" set vdom "root" set forward-error-correction cl91-rs-fec set speed 25000full next edit "ha2" set vdom "root" set forward-error-correction cl91-rs-fec set speed 25000full next edit "aux1" set vdom "root" set forward-error-correction cl91-rs-fec set speed 25000full set mtu-override enable set mtu 9000 edit "aux2" set vdom "root" set forward-error-correction cl91-rs-fec set speed 25000full set mtu-override enable set mtu 9000 next end
Note: Change MTU of aux1/aux2 to 9k for better session-sync performance.
Separate HA heartbeat interfaces (ha1/ha2) from session-sync interfaces (aux1/aux2) and increase HA heartbeat loss threshold:
config system ha set group-id 45 set group-name <name> set mode a-p set sync-packet-balance enable set hbdev "ha1" 20 "ha2" 10 set session-sync-dev "aux1" "aux2" set hb-interval 4 set hb-lost-threshold 12 set session-pickup enable set session-pickup-connectionless enable set override disable end
To implement this adjustment, it is advised to engage Fortinet support for assistance. See the Fortinet Support Portal: Welcome to Fortinet Support.
Conclusion: In summary, the initial LACP interface flap was due to a traffic burst causing NP congestion and preventing the timely handling of LACP packets mainly because the LACP packets were processed with lower priority. After upgrading to FortiOS 7.2.12 GA and implementing the recommended configuration optimization changes, a split-brain scenario emerged due to unexpected high bandwidth (Over 20 Gbps) and CPU overload. By carefully applying the recommended configuration changes and ensuring UTM features are balanced, stable HA behavior and efficient traffic handling can be achieved. |
