FortiGate
FortiGate Next Generation Firewall utilizes purpose-built security processors and threat intelligence security services from FortiGuard labs to deliver top-rated protection and high performance, including encrypted traffic.
msingh_FTNT
Staff
Staff
Article Id 230387
Description

This article describes how to troubleshoot high CPU issues.

Scope FortiGate.
Solution

The first step should always be running 'get sys perf status'. The output will be something as below (depending on the number of cores the unit has):


CPU states: 0% user 0% system 0% nice 1% idle 0% iowait 0% irq 99% softirq
CPU0 states: 0% user 0% system 0% nice 1% idle 0% iowait 0% irq 99% softirq
CPU1 states: 0% user 0% system 0% nice 2% idle 0% iowait 0% irq 98% softirq


Depending on where the CPU usage is in this command, the next step will vary. Following will be the most common scenarios run into:

 

  1. The first issue could be high user space:


CPU states: 99% user 0% system 0% nice 1% idle 0% iowait 0% irq 0% softirq
CPU0 states: 99% user 0% system 0% nice 1% idle 0% iowait 0% irq 0% softirq
CPU1 states: 98% user 0% system 0% nice 2% idle 0% iowait 0% irq 0% softirq

 

If this section is high, only look at 'diag sys top'. The user space is the high CPU from the processes visible in 'diag sys top'.
Now, depending on which process is causing the issue in the user space, it is possible to run further debugging. For example, if it is HTTPS, try 'di de app httpsd -1'.

 

  1. The second issue could be system space:


CPU states: 0% user 99% system 0% nice 1% idle 0% iowait 0% irq 0% softirq
CPU0 states: 0% user 99% system 0% nice 1% idle 0% iowait 0% irq 0% softirq
CPU1 states: 0% user 98% system 0% nice 2% idle 0% iowait 0% irq 0% softirq

 

This is the kernel's own CPU usage, eg. process related to running the operating system (do not use 'diag sys top' at this point to further troubleshoot the issue).

 

At this point, run the CPU profiler, commands below:


diag sys profile cpumask <ID> <----- If all CPUs are busy, then do not need to run this. Otherwise specifying busying CPU ID.
diag sys profile start

 

 <wait 5-10 seconds>
diag sys profile stop
diag sys profile show order

diagnose sys profile show detail

diagnose sys profile sysmap 

 

Now, check which process shows up on top in this output when the last command is run. That process is the problematic one and causes high CPU. If there is a doubt about its name, try searching for tickets or bugs around that process.

 

  1. The last common scenario is usage in softirq (99% softirq).


CPU states: 0% user 0% system 0% nice 1% idle 0% iowait 0% irq 99% softirq
CPU0 states: 0% user 0% system 0% nice 1% idle 0% iowait 0% irq 99% softirq
CPU1 states: 0% user 0% system 0% nice 2% idle 0% iowait 0% irq 98% softirq

 

This usage comes from firewall processing packets. If the firewall is receiving a high number of packets on its interface or the firewall is under DoS attack, high softirq usage will be visible (again do not use 'diag sys top' at this point to further troubleshoot the issue).


There is no particular debug to find which packets or interfaces are receiving this traffic.

At this point, rely on the interface widget/stats to see if any particular interface is receiving too much traffic. Try running an open sniffer to try to find the traffic type that is causing it (let's say a lot of ARP packets coming, it could be a layer2 loop inside the user network).

 

If nothing works, then try disabling one interface at a time to see which one brought down the CPU usage, most likely that interface was receiving high traffic.

 

It can be helpful to run the CPU profiler as described in step 2. High softirq can point towards offloading issues and the profiler can give some insights. Below is an example with high softirq in one CPU core due to decryption/encryption not being offloaded.


diagnose sys profile show order
0xffffffc000487178: 2324 rijndaelDecrypt+0x638/0xc70 
0xffffffc000084260: 444 default_idle+0x10/0x20
0xffffffc0003c98c0: 105 nf_hook_slow+0xa0/0x130

 

Note:

If CPU usage is high in system space or soft IRQ and there is high CPU usage in  'diag sys top', the latter command is giving false information.


Consider one scenario where there is only one core on the firewall which has 90 percent system usage and 10 percent user space usage. After running 'diag sys top', the IPS engine is taking 99 percent CPU.

 

This is not accurate because there is only 10 percent usage in user space and IPS is taking 99 percent of that 10 percent left from total usage. It is not actually using 99 percent of the whole CPU core.

 

This is why it is important to not rely solely on 'diag sys top': it is necessary to look beyond that command.

If it is KVM with DPDK enabled, then DPDK is designed to be 100% busy polling. DPDK is running in IPS processes, so IPS will be always busy when DPDK is enabled.