FortiGate
FortiGate Next Generation Firewall utilizes purpose-built security processors and threat intelligence security services from FortiGuard labs to deliver top-rated protection and high performance, including encrypted traffic.
msingh_FTNT
Staff
Staff
Article Id 230387
Description

This article describes how to troubleshoot high CPU issues.

Scope FortiGate.
Solution

It is recommended to follow this guide to debug CPU issues in a structured way. This article will cover the most common types of CPU load issues: CPU load in user space, system space, or due to softirqs.

 

Run the command 'get sys perf status' to show in which area the load is present. The output will be something as below (depending on the number of cores the unit has):

 

The first line 'CPU states:' shows the average load across all CPU cores. The following lines 'CPU0 states:' show the load per single CPU core.

 

get sys perf stat
CPU states: 18% user 41% system 0% nice 20% idle 0% iowait 0% irq 21% softirq
CPU0 states: 21% user 42% system 0% nice 6% idle 0% iowait 0% irq 31% softirq
CPU1 states: 15% user 56% system 0% nice 11% idle 0% iowait 0% irq 18% softirq
CPU2 states: 15% user 38% system 0% nice 5% idle 0% iowait 0% irq 42% softirq
...

 

Another command that shows the single areas is the 'diagnose sys mpstat' command. It lists the single areas but can do so over a longer time which the previous 'get sys perf stat' can not.

 

The first parameter will define the interval in seconds. <value> Delay in seconds (default 5).

The second parameter defines the number of interactions/repeats. A high value like 99999 will let the commands run for a very long time which is used for long-time monitoring.

 

diagnose sys mpstat 5 99999
Gathering data, wait 5 sec, press any key to quit.
..0..1..2..3..4
TIME CPU %usr %nice %sys %iowait %irq %soft %steal %idle
10:28:44 AM all 19.21 0.00 43.50 0.00 0.05 19.49 0.00 17.75
0 20.16 0.00 39.12 0.00 0.00 31.94 0.00 8.78
1 12.97 0.00 60.68 0.00 0.00 11.78 0.00 14.57
2 16.77 0.00 41.72 0.00 0.00 36.73 0.00 4.79
3 15.57 0.00 59.48 0.00 0.00 12.97 0.00 11.98
4 23.75 0.00 39.72 0.00 0.00 28.34 0.00 8.18
5 25.15 0.00 36.93 0.00 0.00 0.00 0.00 37.92
6 19.16 0.00 37.13 0.00 0.00 38.12 0.00 5.59
7 17.76 0.00 44.11 0.00 0.60 0.00 0.00 37.52
8 15.17 0.00 35.73 0.00 0.00 41.72 0.00 7.39
9 20.56 0.00 45.91 0.00 0.00 0.00 0.00 33.53
10 22.36 0.00 37.33 0.00 0.00 32.34 0.00 7.98
11 21.16 0.00 44.11 0.00 0.00 0.00 0.00 34.73


Depending on the area in which this command shows the CPU usage the next step will vary. Following will be the most common scenarios run into:

 

  1. The first issue could be high user space:

 

get sys perf stat
CPU states: 99% user 0% system 0% nice 1% idle 0% iowait 0% irq 0% softirq
CPU0 states: 99% user 0% system 0% nice 1% idle 0% iowait 0% irq 0% softirq
CPU1 states: 98% user 0% system 0% nice 2% idle 0% iowait 0% irq 0% softirq

 

If this section is high, the command 'diag sys top' will show which userspace process is allocating the CPU resources.

While the command runs enter 'P' to sort by CPU usage:

 

In the example below several of the IPS engines show a higher CPU load of up to 57% on a single core.

 

diag sys top
ipsengine 492 S < 57.2 1.9 6
ipsengine 485 R < 48.7 1.9 8
ipsengine 484 R < 42.7 1.9 4
ipsengine 486 R < 42.7 1.9 0
ipsengine 489 R < 40.7 1.9 10
ipsengine 14217 R < 39.8 1.8 2

 

Another example will show the node process using 92% CPU resources on a single core.

 

diag sys top
Run Time: 79 days, 15 hours and 2 minutes
24U, 0N, 39S, 16I, 0WA, 0HI, 21SI, 0ST; 16046T, 4136F
node 5188 R 92.0 1.5 3


Depending on which process is causing the issue in the user space, it is possible to run further debugging. For example, if it is HTTPS, try 'diag debug app httpsd -1'.

 

 

  1. The second issue could be a high CPU load in system space:

 

get sys perf stat
CPU states: 0% user 99% system 0% nice 1% idle 0% iowait 0% irq 0% softirq
CPU0 states: 0% user 99% system 0% nice 1% idle 0% iowait 0% irq 0% softirq
CPU1 states: 0% user 98% system 0% nice 2% idle 0% iowait 0% irq 0% softirq

 

This is the kernel's own CPU usage, eg. tasks related to running the operating system (do not use 'diag sys top' at this point to further troubleshoot the issue). Typical reasons for a high CPU load in system space could be drivers in kernel space, kernel-based encryption/decryption, disk in/out, etc.

 

At this point, run the CPU profiler, commands below:

 

diag sys profile cpumask <ID> <----- If all CPUs are busy, then do not need to run this. Otherwise specifying busying CPU ID.
diag sys profile start
<wait 5-10 seconds>
diag sys profile stop
diag sys profile show order
diagnose sys profile show detail
diagnose sys profile sysmap 

 

CPU mask range starts with 0. For example, CPU has 16 cores. Therefore CPU mask range will be from 0 to 15. For instance in case CPU core "CPU0" is busy it is necessary to run 'diag sys profile cpumask 0', in case "CPU1" is busy it is necessary to run 'diag sys profile cpumask 1').

 

If the busy CPU core is not known in advance for example when the profilining should be scripted/executed in an automated way then the following command 'diag sys profile report' will run the CPU profiling over cores that are detected as busy.

 

diag sys profile report
CPU Kernel Percentages:
0: 76% (76 of 100). Will profile.
1: 43% (43 of 100). Not profiling.
2: 77% (77 of 100). Will profile.
3: 82% (82 of 100). Will profile.
4: 84% (84 of 100). Will profile.
5: 53% (53 of 100). Will profile.
6: 78% (78 of 100). Will profile.
7: 46% (46 of 100). Not profiling.
8: 77% (77 of 100). Will profile.
9: 56% (56 of 100). Will profile.
10: 81% (81 of 100). Will profile.
11: 43% (43 of 100). Not profiling.
Kernel=0xffffffff80200190-0xffffffff8061699d, module-gap=0x1f9e9663
Profile buffer: profile step=4, sz=12859428-3214857, last-addr=0xffffffffa082d017
warning: functions cannot be profiled properly, early_idt_handlers-_einittext
diagnose sys profile show order
0xffffffc000487178: 2324 rijndaelDecrypt+0x638/0xc70 
0xffffffc000084260: 444 default_idle+0x10/0x20
0xffffffc0003c98c0: 105 nf_hook_slow+0xa0/0x130

 

Check which kernel function shows up on top in this output when the last command is run. That function is the problematic one and causes a high CPU.

 

To verify for known functions implemented in the Linux kernel refer to the Linux documentation. Otherwise contact technical support to research internally to clarify what the busy function is responsible for.

 

 

  1. The last common scenario is usage in softirq (99% softirq).

 

get sys perf stat
CPU states: 0% user 0% system 0% nice 1% idle 0% iowait 0% irq 99% softirq
CPU0 states: 0% user 0% system 0% nice 1% idle 0% iowait 0% irq 99% softirq
CPU1 states: 0% user 0% system 0% nice 2% idle 0% iowait 0% irq 98% softirq

 

This high CPU usage comes from traffic, for example ,if the firewall is processing many packets.

If the firewall is receiving a high number of packets on its interface, the firewall is under DoS attack or there is a layer2 loop, high softirq usage will be visible. Another common reason could be the lack of offloading for example due to the device-identification feature being used on busy interfaces or proxy-based features being used.


There is no particular debug method to find which packets or interfaces are receiving this traffic.

At this point, rely on the interface widget/stats to see if any particular interface is receiving too much traffic.

On the cli the following commands can be used to list interface stats

 

fnsysctl ifconfig

diag debug report

 

Try running an open sniffer to to find the interface and traffic type that is causing the load (for example if a lot of ARP packets coming, it could be a layer2 loop inside the user network):

 

diag sniffer packet any '' 6 100 l

 

If nothing works, then try disabling one interface at a time to see which one brought down the CPU usage, most likely that interface was receiving high traffic.

 

It can be helpful to run the CPU profiler as described in step 2. High softirq can point towards offloading issues and the profiler can give some insights. Below is an example with high softirq:

 

0xffffffff802662d4: 61 __do_softirq+0x64/0x110 <----- This is the most busy function.
0xffffffff80208fb8: 57 poll_idle+0x18/0x30

 

In cases where there may be CPU spikes and user/system/softirq % increased, observe if there is significant amount of traffic running through the firewall using the get system performance status command. The average network usage and maximal network usage can be indications for traffic bursts and it depends on the device specifications if it can handle this amount of traffic.

 

get sys perf stat
Average network usage: 88262/88410 kbps in 1 minute, 201578/199088 kbps in 10 minutes, 144360/139965 kbps in 30 minutes

Maximal network usage: 264123/264285 kbps in 1 minute, 952671/931682 kbps in 10 minutes, 952671/931682 kbps in 30 minutes

 

In the output above, the device is only a FortiGate desktop model and there were instances when the network usage was spiking which correlates with high CPU usage timings. To further verify where is this significant traffic coming from and where it is heading, check in FortiView Sources and FortiView Destinations if there are anything helpful that can be identified (ex. updates/file transfers/script/etc.). 

 

Note:

If CPU usage is high in system space or soft IRQ and there is high CPU usage in  'diag sys top', the latter command is giving false information.


Consider one scenario where there is only one core on the firewall which has 90 percent system usage and 10 percent user space usage. After running 'diag sys top', the IPS engine is taking 99 percent CPU.

 

This is not accurate because there is only 10 percent usage in user space and IPS is taking 99 percent of that 10 percent left from total usage. It is not using 99 percent of the whole CPU core.

 

This is why it is important to not rely solely on 'diag sys top': it is necessary to look beyond that command.

 

If it is KVM with DPDK enabled, then DPDK is designed to be 100% busy polling. DPDK is running in IPS processes, so IPS will be always busy when DPDK is enabled.