Description |
This article explains SoftIrqs, what causes them to increase in frequency or show high variations, and some ways to check for them in FortiGate. |
Scope | FortiGate. |
Solution |
A SoftIrq is a software interrupt. It occurs when traffic reaches the CPU but is not accelerated to the NPU.
A SoftIrq can also be invoked by a special instruction of read or write data to a hardware device (hard-disk). Software interrupts are also crucial when real-time capability is required (such as in industrial applications).
It is possible to check for SoftIrqs in FortiGate and monitor increases by using the following command in the FortiGate CLI (example output is shown below):
dia sys mpstat
By default, this command will continuously fetch data after every 5-second interval until Ctrl+C is pressed to stop it.
dia sys mpstat 3 5
This command will fetch the same data as the command above but with a 3-second interval up to 5 times. Customize these parameters as desired:
get sys performance status CPU states: 0% user 0% system 0% nice 67% idle 0% iowait 0% irq 33% softirq CPU0 states: 0% user 0% system 0% nice 55% idle 0% iowait 0% irq 45% softirq CPU1 states: 0% user 0% system 0% nice 19% idle 0% iowait 0% irq 81% softirq CPU2 states: 1% user 0% system 0% nice 32% idle 0% iowait 0% irq 67% softirq CPU3 states: 0% user 0% system 0% nice 66% idle 0% iowait 0% irq 34% softirq Memory: 1911192k total, 1002652k used (52.5%), 645292k free (33.8%), 263248k freeable (13.8%) Average network usage: 4266268 / 4275456 kbps in 1 minute, 4145133 / 4155622 kbps in 10 minutes, 4091696 / 4101178 kbps in 30 minutes Maximal network usage: 4539464 / 4547537 kbps in 1 minute, 4895169 / 4908443 kbps in 10 minutes, 4895169 / 4908443 kbps in 30 minutes Average sessions: 291687 sessions in 1 minute, 293226 sessions in 10 minutes, 293696 sessions in 30 minutes Maximal sessions: 292629 sessions in 1 minute, 298552 sessions in 10 minutes, 307791 sessions in 30 minutes Average session setup rate: 2776 sessions per second in last 1 minute, 2749 sessions per second in last 10 minutes, 2742 sessions per second in last 30 minutes Maximal session setup rate: 2893 sessions per second in last 1 minute, 3100 sessions per second in last 10 minutes, 3309 sessions per second in last 30 minutes Average NPU sessions: 35 sessions in last 1 minute, 36 sessions in last 10 minutes, 36 sessions in last 30 minutes Maximal NPU sessions: 37 sessions in last 1 minute, 43 sessions in last 10 minutes, 49 sessions in last 30 minutes Average nTurbo sessions: 0 sessions in last 1 minute, 0 sessions in last 10 minutes, 0 sessions in last 30 minutes Maximal nTurbo sessions: 0 sessions in last 1 minute, 0 sessions in last 10 minutes, 0 sessions in last 30 minutes Virus caught: 0 total in 1 minute IPS attacks blocked: 0 total in 1 minute Uptime: 16 days, 17 hours, 47 minutes
Possible reasons for SoftIrq increments: Check network traffic. This behavior might be caused by network loops such as layer2 loop/s, broadcast storms, unwanted packets, large quantities of ARP requests, or loops on the hardware if there are multiple switches connected to the relevant ports. STP breaking after an upgrade could be one of the main factors behind layer 2 loops.
It can also happen due to user traffic not being offloaded to hardware, it may be because offloading is disabled at the Firewall Policy level, of because the traffic is traversing a non-NPU interface. The example shown above will have most of the sessions going through the CPU ('average sessions') and not through the NPU ('average NPU sessions'). This can be also confirmed by looking at the dashboard’s 'Sessions' widget.
Device identification (Device Detection) on interfaces is another contributor to softirqs.
While observing high CPU usage with 'get system performance status', it is possible to see if SoftIrq levels are stable or increasing by executing the command repeatedly.
Troubleshooting steps:
Example output:
============ Counters ===========
id=20085 trace_id=1107 func=ip_route_input_slow line=1704 msg="reverse path check fail, drop"
If sessions are not being offloaded, consider checking FortiGate's session list for possible reasons traffic is not offloading: diagnose sys session list no_ofld_reason field - FortiGate documentation.
In certain scenarios, Layer 2 loops or switch issues can cause traffic to be looped and forwarded to the firewall on the physical interface with untagged packet information. This can potentially lead to CPU core spikes on the firewall.
Run 'get system performance output' to verify if the CPU is going high & in case of the broadcast/L2 loop coming from the switches, the softirq will go high. The output should look like below.
get sys perf status
If seeing softirq going high up to 100%, even for half of the core CPUs, understand that the packets are getting looped between the firewall & the switches.
To investigate this issue further, run 'dia netlink interface packet-rate' in the CLI, to see if receiving a high number of packets at the firewall interface, run this command 4-5 times an intervals of 2-3 seconds & verify the number of packets being received (TX-rate) at the firewall interface.
diagnose netlink interface packet-rate
Collect the below sniffer output to identify what types of packets are coming to the firewall interface. For example, it could be ICMP, esp, or any other TCP/UDP packets.
SSH1:
In the case of ESP packets:
SSH2:
Sniffer output will capture 2000 packets; it is possible to tweak the packet size but be careful while running the sniffer in a CPU device.
Check on the switch side to know why they are forwarding a high number of packets to the firewall and ask them to rate-limit the packets at the Switch end or check if they are sending untagged/ legitimate traffic.
This is how untagged packets will look like in sniffer output, the tagged packets will have VLAN information.
2024-08-27 19:14:20.969882 port1-- x.x.x.x -> y.y.y.y: ESP(spi=0xdba59c0a,seq=0x61a)
In the case of ESP traffic, it will show the same seq numbers repeating for ESP multiple times.
Sniffer output will give us an idea, of whether it is a firewall or switch that is creating a loop. In case, find an issue with the FortiGate creating the loop, reach out to TAC to share all the given log output.
At the firewall end, it is possible to configure an Access Control List (ACL) on the physical interface to block if it is untagged traffic or if it is not legitimate.
Below is an example, of how ESP is blocked and IKE untagged VLAN ID packets received on port1 physical interface.
config firewall acl
Related article: |