DescriptionIn an environment where there is a significant volume of traffic and UTM enabled, the network interfaces handling the traffic may experience Rx_Over_Errors:
diag hardware device nic port3
...
Rx_Packets 301963122
Rx_Packets_Dropped 0
Tx_Packets 194087928
Rx_Bytes 1647625218
Tx_Bytes 1402278946
Rx_Errors 0
Tx_errors 0
Rx_Over_Errors 119680
ScopeGenerally encountered on FortiGate units with significant traffic and UTM scanning. Affects FortiGate A-series more than subsequent models.
The root cause of this issue is high traffic throughput combined with excessive UTM scanning.SolutionThese errors will likely be incrementing. You can confirm by running the 'diag hardware device nic <port>' command multiple times over a 10 minute period.
Run the following commands to help verify the problem (preferably run them a few times, during peak traffic periods):
get sys perf stat
diag hard sysinfo interrupts
Inspect the results of the first command. You will likely notice CPU3 will consistently be high in CPU usage. In particular, system-level usage (packet processing):
Fortigate3810A_a # get sys perf stat
CPU states: 25% user 48% system 0% nice 27% idle
CPU0 states: 30% user 33% system 0% nice 37% idle
CPU1 states: 25% user 38% system 0% nice 37% idle
CPU2 states: 36% user 31% system 0% nice 33% idle
CPU3 states: 10% user 89% system 0% nice 1% idle
Memory states: 32% used
Average network usage: 361915 kbps in 1 minute, 439674 kbps in 10 minutes, 448930 kbps in 30 minutes
Average sessions: 41023 sessions in 1 minute, 40926 sessions in 10 minutes, 41217 sessions in 30 minutes
Average session setup rate: 830 sessions per second in last 1 minute, 805 sessions per second in last 10 minutes, 801 sessions per second in last 30 minutes
Virus caught: 0 total in 1 minute
IPS attacks blocked: 0 total in 1 minute
Uptime: 0 days, 4 hours, 31 minutes
Especially in A-series units, due to the architecture of the design, generally most of the traffic is handled by a single processor, with UTM functions being offloaded to the ASIC's, and user-level processes being handled by the remaining CPU's.
To further confirm, if CPU3 is being overloaded:
Fortigate3810A_a # diag hard sysinfo interrupts
CPU0 CPU1 CPU2 CPU3
0: 154 41846 1649 1667365 IO-APIC-edge timer
2: 0 0 0 0 XT-PIC cascade
3: 0 2226 173 106253 IO-APIC-edge serial
4: 0 14 1 364 IO-APIC-edge serial
5: 0 0 0 0 IO-APIC-level libata
7: 0 0 0 0 IO-APIC-edge LCD_KEYPAD
8: 0 0 0 0 IO-APIC-edge rtc
10: 0 671 35 42454 IO-APIC-level usb-ohci, usb-ohci
14: 1 25408 57 12388 IO-APIC-edge ide0
18: 5 29898 2431 709296 IO-APIC-level ipsec0
22: 6 14733 1522 705930 IO-APIC-level iscp1a0
24: 257 1586202 120512 82936562 IO-APIC-level port1
26: 226 1866211 124607 73492419 IO-APIC-level port3
27: 0 0 0 45 IO-APIC-level port4
28: 108 571368 51371 17234374 IO-APIC-level port5, port6
29: 70 397791 32401 12079801 IO-APIC-level port7, port8
NMI: 1710965 1710965 1710965 1710965
LOC: 1710962 1710957 1710957 1710603
ERR: 0
MIS: 0
In the above output, we can see the CPU interrupts being handled by CPU3 is higher than the other CPU's, especially for port1 and port3 (LAN and WAN ports). Combined with the previous evidence, this indicates CPU3 is overloaded.
There are a few actions that can reduce the load on this CPU. Below is a list of actions that can make varying levels of impact:
Potential Solutions
|
Benefit
|
Commands
|
Disable IPS engines
|
Stops a variety of UTM scanning from taking place including application control, IPS and any flow-based UTM features (default for all is proxy-based). This will free up CPU cycles. Significant benefit.
|
To stop the IPS engines: diag test app ipsmonitor 98
To enable: diag test app ipsmonitor 99
Should be performed during a maintenance window. Ideally under supervision of Fortinet Tech Support.
Note this setting is not retained after a reboot. Also, if using an HA cluster, it should be ran on both units to ensure it is disabled on both.
|
Reduce or remove UTM scanning
|
Will free up more CPU cycles for core packet processing. Significant benefit, depending on how much UTM was in place before, and how much has been removed or tuned back.
|
Performed manually by adjusint or removing UTM from policies.
|
Check and increase proxyworker count
|
Minimal benefit. Should balance UTM demand across cores. Increase value to match the number of CPU cores.
|
conf sys global set proxy-worker-count <value> - should be the number of CPU cores. end
|
Reduce throughput
|
Minimal to significant benefit.
|
|
Configure high availability with active-active
|
Minimal to significant benefit, depending on how much UTM is configured. The more UTM, the larger a positive impact this will make. If traffic can be segmented, it is more efficient to segment the traffic to separate units.
|
|
Upgrade hardware
|
Significant benefit, will resolve issue if unit is sized appropriately for current and future traffic.
|
|
Manual Balancing of IRQ's:
In some extreme cases the IRQ's may need to be balanced manually if you notice a lot of the requests relying on a particular CPU. In the above "diag hard sysinfo interrupts" output we see port1 and port3 are both taxing CPU3 extensively. Further we can confirm high CPU use on CPU3 with "get sys perf stat". We can manually balance these interrupts by placing port1's load on CPU0 and port3's load on CPU2. NOTE: A maintenance window is recommended as this is a significant change to how traffic is managed on the FortiGate. Also note that these changes are lost on reboot and will need to be re-input after a reboot. These commands are intended to help relieve IRQ load as an interim solution to procuring properly sized equipment.
Fortigate3810A_a
# diag hard sysinfo interrupts
CPU0 CPU1 CPU2 CPU3
...
24: 257 1586202 120512 82936562 IO-APIC-level port1
26: 226 1866211 124607 73492419 IO-APIC-level port3
The far left column is the IRQ ID. To balance 24 (port1) to CPU0:
diag sys cpuset interrupt 24 1 (1 is used because the CPU count starts at one, CPU0 is 1, CPU3 is 4).
And finally to balance port3 to CPU2:diag sys cpuset interrupt 26 3
If significant traffic is flowing through the unit you should notice a gradual shift of IRQ values after a few minutes by running "diag hard sys interrupts" again. Also, CPU use on the overtaxed CPU cores should be lowered, confirm using "get sys perf stat".
In some cases, the only option available is a hardware upgrade, which should fully resolve the issue, if sized appropriately.
If further guidance is needed, please contact your Fortinet Sales Representative or Fortinet Technical Support.