Technical Tip: L2 Loops and high softirq on FortiGate

Ehanssen · ‎08-21-2024

Description

This article describes how to detect L2 Loops (aka broadcast storms, switch/Layer 2 loops, etc.) on FortiGate based on performance commands.

Scope

All FortiGate models and versions.

Solution

There are lots of potential reasons for high softirq, such as too much traffic or offloading issues. This article considers L2 loops as a reason for high softirq.

For initial context, softirq CPU usage on the FortiGate is frequently associated with the receiving and processing of incoming packets to a FortiGate network interface. Packets arrive on the network interface and trigger a software interrupt to the CPU (aka a softirq) to signal that the packet has arrived and must be processed.

For hardware FortiGates, softirq usage tends to remain very low since the bulk of the traffic flow can be offloaded to the onboard Network Processor (NP) hardware. Traffic is only handled by the CPU if the inspection is actively taking place, or if the configuration does not allow for hardware offloading of any kind (for example, Software switches, lack of hardware-offloading capability, etc.). Notably, broadcast packets must be processed by the CPU, even if the traffic is not relevant to the receiving host.

With that in mind, broadcast storms/L2 loops are scenarios where broadcast packets are allowed to continuously circulate and accumulate through the network, rather than the packet being sent through the network once.

This situation occurs when Layer 2 network switches are physically connected in such a way that a loop path is allowed to form, and it results in a rapidly-growing flood of traffic circulating through the network.

The flood of traffic caused by an L2 loop can quickly overwhelm the interfaces of any connected device, resulting in high softirq CPU usage and major impacts to the network that can render connected devices unusable. Legitimate user traffic will frequently be dropped or heavily delayed/degraded during this period.

One frequently observed symptom is a deceptively low number of sessions in the performance statistics relative to the amount of CPU usage. In the example below, there are only 200 active sessions, yet there is 90%+ softirq on multiple CPU cores:

FortiGate # get sys perf stat
CPU states: 8% user 6% system 0% nice 4% idle 0% iowait 0% irq 82% softirq
CPU0 states: 2% user 1% system 0% nice 5% idle 0% iowait 0% irq 92% softirq
CPU1 states: 4% user 1% system 0% nice 3% idle 0% iowait 0% irq 92% softirq
CPU2 states: 25% user 23% system 0% nice 4% idle 0% iowait 0% irq 48% softirq
CPU3 states: 0% user 0% system 0% nice 5% idle 0% iowait 0% irq 95% softirq
Memory: 1911044k total, 1092432k used (57.2%), 626820k free (32.8%), 191792k freeable (10.0%)
Average network usage: 699239 / 135 kbps in 1 minute, 725343 / 353 kbps in 10 minutes, 763050 / 415 kbps in 30 minutes
Average sessions: 143 sessions in 1 minute, 215 sessions in 10 minutes, 198 sessions in 30 minutes
Uptime: 0 days, 18 hours, 0 minutes

One way to determine if a broadcast storm/L2 loop situation is occurring is based on performance statistics. Check the interface statistics on the FortiGate with the command fnsysctl ifconfig and look out for a large amount of Received (RX/incoming) bytes relative to Transmitted (TX/outgoing) bytes. This will include bogus (for example, non-useful/storm) L2 traffic, and in this example, the output shows roughly 5800GB of received data compared to 2GB of transmitted data.

FGT01# fnsysctl ifconfig

internal Link encap:Ethernet HWaddr 70:4C:A5:BE:F4:1D
inet addr:10.104.0.1 Bcast:10.104.0.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:56320794832 errors:0 dropped:0 overruns:0 frame:0
TX packets:8629957 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:6264579466586 (5834.3 GB) TX bytes:2197191565 (2.0 GB)

Use the CLI command to get system performance status to check the uptime of the FortiGate/cluster, then divide the volume of data received by the FortiGate's uptime to estimate the rate of traffic hitting the interface per second. In this example, this would be roughly 85 MB/s (680Mbps sustained since the start), which is abnormal when accounting for only 200 active sessions (and when factoring in historical usage for the network environment).

5,800,000 MB / (19*60*60) = 84,8 MB/s

Note that the interface statistics produced by fnsysctl ifconfig are accrued after the FortiGate has booted up. If the device was running for a long time, then the calculated number on its own might suggest that no loop is happening.

For a more accurate calculation, run this command once, wait several seconds (for example, 15 to 60 seconds), then run the command again. From there, calculate the difference in Received bytes between the two commands (i.e., the Delta of Received bytes) and then divide that by the number of seconds waited to determine how much data was received within the testing period.

Another method available when checking for network loops is to run a brief network capture without any filters:

The sniffer only needs to be run for a handful of packets (10 to 100 at most) to gather a useful sample. The sample will show if there is a high volume of broadcast traffic being received compared to useful unicast traffic, and this can be used to determine if a broadcast storm is occurring.

It is possible to make interface bandwidth widgets for up to 25 interfaces in versions before and including v7.2. In v7.4 and above, there is no limit. These can be useful when troubleshooting loops or broadcast storms when it is not known which interface is having the issue. It may be possible to see a large amount of inbound traffic with no outbound traffic on an interface experiencing a broadcast storm. If high softIRQ is seen intermittently, it may be possible to match this to unexpected traffic seen on a widget. See this article for more info:
Technical Tip: How to check interface bandwidth utilization from GUI

To resolve a broadcast storm, the network loop must be eliminated and subsequently prevented from occurring. Disconnecting network interfaces that are creating this looped path is the recommended immediate course of action, and utilizing protocols like Spanning Tree can be useful for preventing loops from forming in the first place.

If the switching is done via FortiSwitch, refer to the following article:
Technical Tip: Identify a LOOP layer 2 under FortiSwitch

To help identify the origin of a network loop, FortiOS v7.6.0 introduces a new feature: Logging MAC Address Flapping Events.

This feature is very useful because if the same MAC address is learned on different FortiGate interfaces, it will be logged, making the loop mitigation easier and faster.

Additionally, if the logs are sent to a monitoring tool or Syslog server, detecting and addressing them promptly will reduce the duration of severe outages caused by broadcast storms.

Details about this feature are provided below:

Logging MAC address flapping events

Related articles:

Troubleshooting Tip: Check SoftIrq increments (recommended when experiencing high CPU usage)

Troubleshooting Tip: How high CPU usage should be investigated

Technical Tip: L2 Loops and high softirq on FortiGate

You are leaving our website