Solution |
A policy config change in a FortiGate with about 24K policies causes high CPU usage in CPU0 for a long time. This was tested in a FortiGate chassis 6301F model running 7.2.11 version, having about 24K policies in 1 VDOM. This also applies to the other FortiGate models, as it is a FortiOS behavior.
Any policy change in a VDOM with 24K policies took 170 seconds (about 3 minutes) to load the policies into the kernel. During this time, CPU0 will be at 100% CPU for approximately 3 minutes in all FPCs because, from FortiOS 7.0 version, cmdbsvr-affinity is set to CPU0 in system global settings (set cmdbsvr-affinity '1'). This will map the cmdbsvr process and its child process to run only on CPU0. Long time taken to update policies in kernel and CPU0 being busy is an expected behavior because installing so many policies into the kernel will take time and CPU.
FortiGate(global) # fnsysctl cat /proc/311/status Name: cmdbsvr State: S (sleeping) Tgid: 311 Pid: 311 PPid: 1 TracerPid: 0 Uid: 0 0 0 0 Gid: 0 0 0 0 FDSize: 64 Groups: VmPeak: 1479768 kB VmSize: 1475272 kB VmLck: 0 kB VmPin: 0 kB VmHWM: 676900 kB VmRSS: 672384 kB VmData: 648120 kB VmStk: 284 kB VmExe: 42820 kB VmLib: 23336 kB VmPTE: 1640 kB VmSwap: 0 kB Threads: 1 SigQ: 0/128436 SigPnd: 0000000000000000 ShdPnd: 0000000000000000 SigBlk: 0000000000000000 SigIgn: 0000000000001200 SigCgt: 000000018003e4e8 CapInh: 0000000000000000 CapPrm: ffffffffffffffff CapEff: ffffffffffffffff CapBnd: ffffffffffffffff Cpus_allowed: 000001 Cpus_allowed_list: 0 voluntary_ctxt_switches: 2563806 nonvoluntary_ctxt_switches: 14198
This means it is allowed to run on only 1 CPU, which is CPU0.
The following are the child processes in cmdbsvr, and they are all mapped to CPU0.
- cmdbsvr_cfgsave.
- cmdbsvr_iprope.
- cmdbsvr_writeco.
- cmdbsvr_checksu.
When a policy config is changed, cmdbsvr_cfgsave and cmdbsvr_iprope start to consume CPU in CPU0. From the diagnose sys top output, cmdbsvr_cfgsave and cmdbsvr_iprope share the idle CPU, and in this case, each uses 50% of the CPU for about 1 minute.
Run Time: 3 days, 20 hours and 16 minutes 3U, 0N, 0S, 97I, 0WA, 0HI, 0SI, 0ST; 32180T, 22840F cmdbsvr_cfgsave 16016 R 50.0 0.2 0 cmdbsvr_iprope 16015 R 50.0 0.1 0
Run Time: 3 days, 20 hours and 16 minutes 3U, 0N, 0S, 97I, 0WA, 0HI, 0SI, 0ST; 32180T, 22776F cmdbsvr_cfgsave 16016 R 49.5 1.1 0 cmdbsvr_iprope 16015 R 49.5 0.9 0
Once cmdbsvr_cfgsave has completed its task, cmdbsvr_writeco and cmdbsvr_checksu start to perform some tasks and consume CPU. There will be three cmdbsvr child processes, cmdbsvr_writeco, cmdbsvr_checksu, and cmdbsvr_iprope, running in parallel and performing their tasks, sharing an equal amount of CPU usage.
Run Time: 3 days, 20 hours and 17 minutes 3U, 0N, 1S, 96I, 0WA, 0HI, 0SI, 0ST; 32180T, 22561F cmdbsvr_checksu 16037 R 33.6 0.0 0 cmdbsvr_iprope 16015 R 32.6 2.5 0 cmdbsvr_writeco 16036 R 32.6 0.1 0
Run Time: 3 days, 20 hours and 17 minutes 3U, 0N, 0S, 97I, 0WA, 0HI, 0SI, 0ST; 32180T, 22545F cmdbsvr_iprope 16015 R 34.0 2.5 0 cmdbsvr_writeco 16036 R 34.0 0.3 0 cmdbsvr_checksu 16037 R 32.0 0.2 0
Following is an example of snippets showing system time, CPU0 usage, and 'diagnose sys top' output periodically when a config change is performed in a FortiGate 6301F model having about 24K policies in one VDOM.
A change was made at 09:18:13, and CPU0 started going high and normalized at 09:21:04 when it finished loading the policies.
System time: Fri Jul 11 09:18:14 2025
CPU0 states: 83% user 16% system 0% nice 1% idle 0% iowait 0% irq 0% softirq
FortiGate(global) # diagnose sys top 1 10 3 Run Time: 3 days, 20 hours and 16 minutes 5U, 0N, 1S, 94I, 0WA, 0HI, 0SI, 0ST; 32180T, 22782F cmdbsvr_cfgsave 16016 R 50.0 1.1 0 cmdbsvr_iprope 16015 R 50.0 0.8 0
System time: Fri Jul 11 09:18:27 2025 CPU0 states: 85% user 14% system 0% nice 1% idle 0% iowait 0% irq 0% softirq
FortiGate(global) # diagnose sys top 1 10 3 Run Time: 3 days, 20 hours and 16 minutes 3U, 0N, 0S, 97I, 0WA, 0HI, 0SI, 0ST; 32180T, 22709F cmdbsvr_cfgsave 16016 R 50.2 1.7 0 cmdbsvr_iprope 16015 R 49.7 1.7 0
System time: Fri Jul 11 09:18:40 2025 CPU0 states: 81% user 19% system 0% nice 0% idle 0% iowait 0% irq 0% softirq
FortiGate(global) # diagnose sys top 1 10 3 Run Time: 3 days, 20 hours and 16 minutes 3U, 0N, 1S, 96I, 0WA, 0HI, 0SI, 0ST; 32180T, 22561F cmdbsvr_iprope 16015 R 39.5 2.5 0 cmdbsvr_cfgsave 16016 S 20.0 1.7 0 cmdbsvr_writeco 16036 R 19.5 0.0 0 cmdbsvr_checksu 16037 R 19.0 0.0 0
System time: Fri Jul 11 09:18:53 2025 CPU0 states: 76% user 23% system 0% nice 1% idle 0% iowait 0% irq 0% softirq FortiGate(global) # diagnose sys top 1 10 3 Run Time: 3 days, 20 hours and 17 minutes 5U, 0N, 1S, 94I, 0WA, 0HI, 0SI, 0ST; 32180T, 22547F miglogd 342 S 76.1 2.0 22 cmdbsvr_iprope 16015 R 33.3 2.5 0 cmdbsvr_checksu 16037 R 33.3 0.2 0 cmdbsvr_writeco 16036 R 32.3 0.3 0
System time: Fri Jul 11 09:19:06 2025 CPU0 states: 81% user 18% system 0% nice 1% idle 0% iowait 0% irq 0% softirq FortiGate(global) # diagnose sys top 1 10 3 Run Time: 3 days, 20 hours and 17 minutes 3U, 0N, 0S, 97I, 0WA, 0HI, 0SI, 0ST; 32180T, 22536F cmdbsvr_iprope 16015 R 33.5 2.5 0 cmdbsvr_writeco 16036 R 33.0 0.6 0 cmdbsvr_checksu 16037 R 33.0 0.3 0
System time: Fri Jul 11 09:19:19 2025 CPU0 states: 78% user 21% system 0% nice 1% idle 0% iowait 0% irq 0% softirq FortiGate(global) # diagnose sys top 1 10 3 Run Time: 3 days, 20 hours and 17 minutes 3U, 0N, 0S, 97I, 0WA, 0HI, 0SI, 0ST; 32180T, 22525F cmdbsvr_iprope 16015 R 33.5 2.5 0 cmdbsvr_writeco 16036 R 33.0 0.8 0 cmdbsvr_checksu 16037 R 33.0 0.5 0
System time: Fri Jul 11 09:19:32 2025 CPU0 states: 83% user 16% system 0% nice 1% idle 0% iowait 0% irq 0% softirq FortiGate(global) # diagnose sys top 1 10 3 Run Time: 3 days, 20 hours and 17 minutes 3U, 0N, 0S, 97I, 0WA, 0HI, 0SI, 0ST; 32180T, 22514F cmdbsvr_iprope 16015 R 33.5 2.5 0 cmdbsvr_checksu 16037 R 33.5 0.7 0 cmdbsvr_writeco 16036 R 33.0 1.1 0
System time: Fri Jul 11 09:19:45 2025 CPU0 states: 81% user 18% system 0% nice 1% idle 0% iowait 0% irq 0% softirq FortiGate(global) # diagnose sys top 1 10 3 Run Time: 3 days, 20 hours and 18 minutes 3U, 0N, 0S, 97I, 0WA, 0HI, 0SI, 0ST; 32180T, 22504F cmdbsvr_iprope 16015 R 33.5 2.5 0 cmdbsvr_writeco 16036 R 33.5 1.4 0 cmdbsvr_checksu 16037 R 33.0 0.9 0
System time: Fri Jul 11 09:19:59 2025 CPU0 states: 80% user 19% system 0% nice 1% idle 0% iowait 0% irq 0% softirq FortiGate(global) # diagnose sys top 1 10 3 Run Time: 3 days, 20 hours and 18 minutes 3U, 0N, 0S, 97I, 0WA, 0HI, 0SI, 0ST; 32180T, 22509F cmdbsvr_iprope 16015 R 33.8 2.5 0 cmdbsvr_checksu 16037 R 33.8 1.0 0
System time: Fri Jul 11 09:20:12 2025 CPU0 states: 76% user 23% system 0% nice 1% idle 0% iowait 0% irq 0% softirq FortiGate(global) # diagnose sys top 1 10 3 Run Time: 3 days, 20 hours and 18 minutes 3U, 0N, 0S, 97I, 0WA, 0HI, 0SI, 0ST; 32180T, 22495F cmdbsvr_iprope 16015 R 50.0 2.5 0 cmdbsvr_checksu 16037 R 50.0 1.3 0
System time: Fri Jul 11 09:20:25 2025 CPU0 states: 77% user 22% system 0% nice 1% idle 0% iowait 0% irq 0% softirq FortiGate(global) # diagnose sys top 1 10 3 Run Time: 3 days, 20 hours and 18 minutes 3U, 0N, 1S, 96I, 0WA, 0HI, 0SI, 0ST; 32180T, 22483F cmdbsvr_iprope 16015 R 49.7 2.5 0 cmdbsvr_checksu 16037 R 49.2 1.6 0
System time: Fri Jul 11 09:20:38 2025 CPU0 states: 83% user 16% system 0% nice 1% idle 0% iowait 0% irq 0% softirq FortiGate (global) # diagnose sys top 1 10 3 Run Time: 3 days, 20 hours and 18 minutes 3U, 0N, 0S, 97I, 0WA, 0HI, 0SI, 0ST; 32180T, 22475F cmdbsvr_iprope 16015 R 99.9 2.5 0
System time: Fri Jul 11 09:20:51 2025 CPU0 states: 78% user 22% system 0% nice 0% idle 0% iowait 0% irq 0% softirq FortiGate(global) # diagnose sys top 1 10 3 Run Time: 3 days, 20 hours and 19 minutes 3U, 0N, 0S, 97I, 0WA, 0HI, 0SI, 0ST; 32180T, 22457F cmdbsvr_iprope 16015 R 99.5 2.5 0
System time: Fri Jul 11 09:21:04 2025 CPU0 states: 0% user 0% system 0% nice 100% idle 0% iowait 0% irq 0% softirq
FortiGate(global) # diagnose sys top 1 10 3 Run Time: 3 days, 20 hours and 19 minutes 0U, 0N, 0S, 100I, 0WA, 0HI, 0SI, 0ST; 32180T, 22828F forticron 331 S 0.4 0.2 12 newcli 16121 S 0.4 0.0 5
Suppose there is some pass-through traffic of around 200 to 400 Mbps and that is being processed by CPU0, then the pass-through traffic is not affected when cmdbsvr is utilizing high CPU for a long time. Traffic is given priority. For example, if softirq is 50% in CPU0, then the cmdbsvr process can utilize the remaining 50%.
In this case, a similar policy change will require about 4.5 minutes to load in the kernel because earlier it had 100% idle CPU, so cmdbsvr can utilize the whole 100% which took 3 minutes, and now it has only 50% idle CPU to utilize, so the policy compilation in the kernel will take extra time.
Also, this will not affect the routing daemons bfdd, bgpd, and ospfd because these processes are excluded from CPU0 in normal FortiGates; however, in the FortiGate-6301F chassis model, bfdd, bgpd, and ospfd are excluded from CPU0 only in MBD and not excluded in FPC cards. Even in this case, the routing daemons can run in less-utilized CPU's so a config change on CPU0 will not affect the routing daemons in FortiGate-6301F as well.
Here is an example of 200Mbps being handled by CPU0, and at the same time, a config change was performed. 12% is utilized by softirq for traffic processing, and the remaining is utilized for policy change.
System time: Fri Jul 11 09:33:17 2025 CPU0 states: 71% user 16% system 0% nice 1% idle 0% iowait 0% irq 12% softirq Average network usage: 237887 / 239198 kbps in 1 minute, 69484 / 70225 kbps in 10 minutes, 23248 / 23470 kbps in 30 minutes
FortiGate(global) # diagnose sys top 1 10 3 Run Time: 3 days, 20 hours and 31 minutes 2U, 0N, 1S, 97I, 0WA, 0HI, 0SI, 0ST; 32180T, 22500F cmdbsvr_iprope 16402 R 49.7 2.5 0 cmdbsvr_checksu 16422 R 49.2 1.4 0
System time: Fri Jul 11 09:33:30 2025 CPU0 states: 63% user 26% system 0% nice 0% idle 0% iowait 0% irq 11% softirq Average network usage: 260490 / 261894 kbps in 1 minute, 83353 / 84149 kbps in 10 minutes, 37117 / 37400 kbps in 30 minutes FortiGate(global) # diagnose sys top 1 10 3 Run Time: 3 days, 20 hours and 31 minutes 3U, 0N, 0S, 97I, 0WA, 0HI, 0SI, 0ST; 32180T, 22497F cmdbsvr_iprope 16402 R 99.5 2.5 0
So cmdbsvr process running high CPU in CPU0 for a long time will not affect the pass-through traffic processed by the same CPU and will not affect the routing daemons.
If the user wants to reduce the policy load time, then it is possible to change the cmdbsvr-affinity setting from CPU0 to all cores. For example, if a FortiGate unit has 24 cores and then below is the setting to use all 24 cores for the cmdbsvr process. The CPU affinity value can be calculated using this link (https://bitsum.com/tools/cpu-affinity-calculator/) based on the number of CPUs present in a FortiGate and which CPU you want to allow for the cmdbsvr process to run.
FortiGate(global) # config sys global FortiGate(global) # set cmdbsvr-affinity 0xffffff FortiGate(global) # end
FortiGate(global) # diagnose sys process pidof cmdbsvr 311
FortiGate(global) # fnsysctl cat /proc/311/status Name: cmdbsvr State: S (sleeping) Tgid: 311 Pid: 311 PPid: 1 TracerPid: 0 Uid: 0 0 0 0 Gid: 0 0 0 0 FDSize: 64 Groups: VmPeak: 1479768 kB VmSize: 1479512 kB VmLck: 0 kB VmPin: 0 kB VmHWM: 676900 kB VmRSS: 672588 kB VmData: 648120 kB VmStk: 284 kB VmExe: 42820 kB VmLib: 23336 kB VmPTE: 1644 kB VmSwap: 0 kB Threads: 1 SigQ: 0/128436 SigPnd: 0000000000000000 ShdPnd: 0000000000000000 SigBlk: 0000000000000000 SigIgn: 0000000000001200 SigCgt: 000000018003e4e8 CapInh: 0000000000000000 CapPrm: ffffffffffffffff CapEff: ffffffffffffffff CapBnd: ffffffffffffffff Cpus_allowed: ffffff Cpus_allowed_list: 0-23 voluntary_ctxt_switches: 2564966 nonvoluntary_ctxt_switches: 14198
In this case, the cmdbsvr child processes will run in parallel on different cores, and it can utilize the idle CPU from different cores at the same and this will bring down the load time for policy installation in the kernel. Technically, with a simple policy change in a VDOM having 24K policies, it will take approximately 1.5 minutes compared to the previous test, which took 3 minutes to complete.
With cmdbsvr-affinity set to 0xffffff and a policy change at 09:35:40 took approximately 1 minute and 20 minutes and finished around 09:36:59. Notice the child process running on cores 14, 8, 2, etc.
System time: Fri Jul 11 09:35:40 2025 FortiGate(global) # diagnose sys top 1 10 3 Run Time: 3 days, 20 hours and 33 minutes 8U, 0N, 1S, 91I, 0WA, 0HI, 0SI, 0ST; 32180T, 22808F cmdbsvr_cfgsave 16550 R 99.9 0.5 14 cmdbsvr_iprope 16549 R 99.9 0.4 8
System time: Fri Jul 11 09:35:54 2025 FortiGate(global) # diagnose sys top 1 10 3 Run Time: 3 days, 20 hours and 34 minutes 7U, 0N, 1S, 92I, 0WA, 0HI, 0SI, 0ST; 32180T, 22661F cmdbsvr_iprope 16549 R 99.9 2.1 14 cmdbsvr_cfgsave 16550 R 99.9 1.7 2
System time: Fri Jul 11 09:36:07 2025
FortiGate(global) # diagnose sys top 1 10 3 Run Time: 3 days, 20 hours and 34 minutes 16U, 0N, 4S, 80I, 0WA, 0HI, 0SI, 0ST; 32180T, 22541F cmdbsvr_iprope 16549 R 99.9 2.5 18 miglogd 342 R 99.9 2.0 22 dnsproxy 808 R 99.9 1.8 10 cmdbsvr_writeco 16563 R 99.9 0.6 8 cmdbsvr_checksu 16564 R 99.9 0.4 0
System time: Fri Jul 11 09:36:20 2025 FortiGate(global) # diagnose sys top 1 10 3 Run Time: 3 days, 20 hours and 34 minutes 9U, 0N, 2S, 89I, 0WA, 0HI, 0SI, 0ST; 32180T, 22512F cmdbsvr_iprope 16549 R 99.9 2.5 9 cmdbsvr_writeco 16563 R 99.9 1.4 15 cmdbsvr_checksu 16564 R 99.9 0.8 0
System time: Fri Jul 11 09:36:33 2025 FortiGate(global) # diagnose sys top 1 10 3 Run Time: 3 days, 20 hours and 34 minutes 6U, 0N, 1S, 93I, 0WA, 0HI, 0SI, 0ST; 32180T, 22500F cmdbsvr_iprope 16549 R 99.5 2.5 13 cmdbsvr_checksu 16564 R 99.5 1.3 1
System time: Fri Jul 11 09:36:46 2025 FortiGate(global) # diagnose sys top 1 10 3 Run Time: 3 days, 20 hours and 35 minutes 3U, 0N, 0S, 97I, 0WA, 0HI, 0SI, 0ST; 32180T, 22488F cmdbsvr_iprope 16549 R 99.9 2.5 22
System time: Fri Jul 11 09:36:59 2025 FortiGate(global) # diagnose sys top 1 10 3 Run Time: 3 days, 20 hours and 35 minutes 3U, 0N, 0S, 97I, 0WA, 0HI, 0SI, 0ST; 32180T, 22471F cmdbsvr_iprope 16549 R 99.9 2.5 22
The cmdbsvr-affinity setting is present from FortiOS 7.0 version. Behavior in FortiOS 6.4 version is different as it does not have cmdb child processes, and all these tasks will be done only by the cmdbsvr process, which is allowed to run on any cores, and the policy load time will still be the same in all these versions. There is no change in the policy load time between these versions.
|