Created on 11-20-2022 10:27 PM Edited on 11-29-2024 09:13 AM By Anthony_E
Description |
This article describes how to troubleshoot high CPU issues. |
Scope | FortiGate. |
Solution |
It is recommended to follow this guide to debug CPU issues in a structured way. This article will cover the most common types of CPU load issues: CPU load in user space, system space, or due to softirqs.
Run the command 'get sys perf status' to show in which area the load is present. The output will be something as below (depending on the number of cores the unit has):
The first line 'CPU states:' shows the average load across all CPU cores. The following lines 'CPU0 states:' show the load per single CPU core.
get sys perf stat
Another command that shows the single areas is the 'diagnose sys mpstat' command. It lists the single areas but can do so over a longer time which the previous 'get sys perf stat' can not.
The first parameter will define the interval in seconds. <value> Delay in seconds (default 5). The second parameter defines the number of interactions/repeats. A high value like 99999 will let the commands run for a very long time which is used for long-time monitoring.
diagnose sys mpstat 5 99999
get sys perf stat
If this section is high, the command 'diag sys top' will show which userspace process is allocating the CPU resources. While the command runs enter 'P' to sort by CPU usage:
In the example below several of the IPS engines show a higher CPU load of up to 57% on a single core.
diag sys top
Another example will show the node process using 92% CPU resources on a single core.
diag sys top
get sys perf stat
This is the kernel's own CPU usage, eg. tasks related to running the operating system (do not use 'diag sys top' at this point to further troubleshoot the issue). Typical reasons for a high CPU load in system space could be drivers in kernel space, kernel-based encryption/decryption, disk in/out, etc.
At this point, run the CPU profiler, commands below:
diag sys profile cpumask <ID> <----- If all CPUs are busy, then do not need to run this. Otherwise specifying busying CPU ID.
CPU mask range starts with 0. For example, CPU has 16 cores. Therefore CPU mask range will be from 0 to 15. For instance in case CPU core "CPU0" is busy it is necessary to run 'diag sys profile cpumask 0', in case "CPU1" is busy it is necessary to run 'diag sys profile cpumask 1').
If the busy CPU core is not known in advance for example when the profilining should be scripted/executed in an automated way then the following command 'diag sys profile report' will run the CPU profiling over cores that are detected as busy.
diag sys profile report
Check which kernel function shows up on top in this output when the last command is run. That function is the problematic one and causes a high CPU.
To verify for known functions implemented in the Linux kernel refer to the Linux documentation. Otherwise contact technical support to research internally to clarify what the busy function is responsible for.
get sys perf stat
This high CPU usage comes from traffic, for example ,if the firewall is processing many packets. If the firewall is receiving a high number of packets on its interface, the firewall is under DoS attack or there is a layer2 loop, high softirq usage will be visible. Another common reason could be the lack of offloading for example due to the device-identification feature being used on busy interfaces or proxy-based features being used.
At this point, rely on the interface widget/stats to see if any particular interface is receiving too much traffic. On the cli the following commands can be used to list interface stats
fnsysctl ifconfig diag debug report
Try running an open sniffer to to find the interface and traffic type that is causing the load (for example if a lot of ARP packets coming, it could be a layer2 loop inside the user network):
diag sniffer packet any '' 6 100 l
If nothing works, then try disabling one interface at a time to see which one brought down the CPU usage, most likely that interface was receiving high traffic.
It can be helpful to run the CPU profiler as described in step 2. High softirq can point towards offloading issues and the profiler can give some insights. Below is an example with high softirq:
0xffffffff802662d4: 61 __do_softirq+0x64/0x110 <----- This is the most busy function.
In cases where there may be CPU spikes and user/system/softirq % increased, observe if there is significant amount of traffic running through the firewall using the get system performance status command. The average network usage and maximal network usage can be indications for traffic bursts and it depends on the device specifications if it can handle this amount of traffic.
get sys perf stat
In the output above, the device is only a FortiGate desktop model and there were instances when the network usage was spiking which correlates with high CPU usage timings. To further verify where is this significant traffic coming from and where it is heading, check in FortiView Sources and FortiView Destinations if there are anything helpful that can be identified (ex. updates/file transfers/script/etc.).
Note: If CPU usage is high in system space or soft IRQ and there is high CPU usage in 'diag sys top', the latter command is giving false information.
This is not accurate because there is only 10 percent usage in user space and IPS is taking 99 percent of that 10 percent left from total usage. It is not using 99 percent of the whole CPU core.
This is why it is important to not rely solely on 'diag sys top': it is necessary to look beyond that command.
If it is KVM with DPDK enabled, then DPDK is designed to be 100% busy polling. DPDK is running in IPS processes, so IPS will be always busy when DPDK is enabled. |
The Fortinet Security Fabric brings together the concepts of convergence and consolidation to provide comprehensive cybersecurity protection for all users, devices, and applications and across all network edges.
Copyright 2024 Fortinet, Inc. All Rights Reserved.