Technical Tip: Rx_Over_Errors on a network interface

Adam_Shortt_FTNT · ‎10-02-2013

Description

This article describes that in an environment where there is a significant volume of traffic and UTM enabled, the network interfaces handling the traffic may experience Rx_Over_Errors:

diag hardware device nic port3
...
Rx_Packets            301963122
Rx_Packets_Dropped        0
Tx_Packets            194087928
Rx_Bytes            1647625218
Tx_Bytes            1402278946
Rx_Errors            0
Tx_errors            0
Rx_Over_Errors            119680

Scope

Generally encountered on FortiGates with significant traffic and UTM scanning. Affects FortiGate A-series more than subsequent models.

The root cause of this issue is high traffic throughput combined with excessive UTM scanning.

Solution

These errors will likely be incrementing. You can confirm by running the 'diag hardware device nic <port>' command multiple times over a 10-minute period.

Run the following commands to help verify the problem (preferably run them a few times, during peak traffic periods):

get sys perf stat
diag hard sysinfo interrupts

Inspect the results of the first command. You will likely notice CPU3 will consistently be high in CPU usage. In particular, system-level usage (packet processing):

Fortigate3810A_a # get sys perf stat
CPU states: 25% user 48% system 0% nice 27% idle
CPU0 states: 30% user 33% system 0% nice 37% idle
CPU1 states: 25% user 38% system 0% nice 37% idle
CPU2 states: 36% user 31% system 0% nice 33% idle
CPU3 states: 10% user 89% system 0% nice 1% idle
Memory states: 32% used
Average network usage: 361915 kbps in 1 minute, 439674 kbps in 10 minutes, 448930 kbps in 30 minutes
Average sessions: 41023 sessions in 1 minute, 40926 sessions in 10 minutes, 41217 sessions in 30 minutes
Average session setup rate: 830 sessions per second in last 1 minute, 805 sessions per second in last 10 minutes, 801 sessions per second in last 30 minutes
Virus caught: 0 total in 1 minute
IPS attacks blocked: 0 total in 1 minute
Uptime: 0 days, 4 hours, 31 minutes

Especially in A-series units, due to the architecture of the design, generally most of the traffic is handled by a single processor, with UTM functions being offloaded to the ASIC's, and user-level processes being handled by the remaining CPU's.

To further confirm, if CPU3 is being overloaded:

Fortigate3810A_a # diag hard sysinfo interrupts
           CPU0       CPU1       CPU2       CPU3
0:        154      41846       1649    1667365    IO-APIC-edge timer
2:          0          0          0          0          XT-PIC cascade
3:          0       2226        173     106253    IO-APIC-edge serial
4:          0         14          1        364    IO-APIC-edge serial
5:          0          0          0          0   IO-APIC-level libata
7:          0          0          0          0    IO-APIC-edge LCD_KEYPAD
8:          0          0          0          0    IO-APIC-edge rtc
10:          0        671         35      42454   IO-APIC-level usb-ohci, usb-ohci
14:          1      25408         57      12388    IO-APIC-edge ide0
18:          5      29898       2431     709296   IO-APIC-level ipsec0
22:          6      14733       1522     705930   IO-APIC-level iscp1a0
24:        257    1586202     120512   82936562   IO-APIC-level port1
26:        226    1866211     124607   73492419   IO-APIC-level port3
27:          0          0          0         45   IO-APIC-level port4
28:        108     571368      51371   17234374   IO-APIC-level port5, port6
29:         70     397791      32401   12079801   IO-APIC-level port7, port8
NMI:    1710965    1710965    1710965    1710965
LOC:    1710962    1710957    1710957    1710603
ERR:          0
MIS:          0

In the above output, we can see the CPU interrupts being handled by CPU3 is higher than the other CPU's, especially for port1 and port3 (LAN and WAN ports). Combined with the previous evidence, this indicates CPU3 is overloaded.

There are a few actions that can reduce the load on this CPU. Below is a list of actions that can make varying levels of impact:

Potential Solutions	Benefit	Commands
Disable IPS engines	Stops a variety of UTM scanning from taking place including application control, IPS and any flow-based UTM features (default for all is proxy-based). This will free up CPU cycles. Significant benefit.	To stop the IPS engines: diag test app ipsmonitor 98 To enable: diag test app ipsmonitor 99 Should be performed during a maintenance window. Ideally under supervision of Fortinet Tech Support. Note this setting is not retained after a reboot. Also, if using an HA cluster, it should be ran on both units to ensure it is disabled on both.
Reduce or remove UTM scanning	Will free up more CPU cycles for core packet processing. Significant benefit, depending on how much UTM was in place before, and how much has been removed or tuned back.	Performed manually by adjusint or removing UTM from policies.
Check and increase proxyworker count	Minimal benefit. Should balance UTM demand across cores. Increase value to match the number of CPU cores.	conf sys global set proxy-worker-count <value> - should be the number of CPU cores. end
Reduce throughput	Minimal to significant benefit.
Configure high availability with active-active	Minimal to significant benefit, depending on how much UTM is configured. The more UTM, the larger a positive impact this will make. If traffic can be segmented, it is more efficient to segment the traffic to separate units.
Upgrade hardware	Significant benefit, will resolve issue if unit is sized appropriately for current and future traffic.

Manual Balancing of IRQ's:
In some extreme cases the IRQ's may need to be balanced manually if you notice a lot of the requests relying on a particular CPU. In the above "diag hard sysinfo interrupts" output we see port1 and port3 are both taxing CPU3 extensively. Further we can confirm high CPU use on CPU3 with "get sys perf stat". We can manually balance these interrupts by placing port1's load on CPU0 and port3's load on CPU2. NOTE: A maintenance window is recommended as this is a significant change to how traffic is managed on the FortiGate. Also note that these changes are lost on reboot and will need to be re-input after a reboot. These commands are intended to help relieve IRQ load as an interim solution to procuring properly sized equipment.

Fortigate3810A_a # diag hard sysinfo interrupts

           CPU0       CPU1       CPU2       CPU3
...
24:        257    1586202     120512   82936562   IO-APIC-level port1
26:        226    1866211     124607   73492419   IO-APIC-level port3

The far left column is the IRQ ID. To balance 24 (port1) to CPU0:

diag sys cpuset interrupt 24 1 <----- 1 is used because the CPU count starts at one, CPU0 is 1, CPU3 is 4.

And finally to balance port3 to CPU2:

diag sys cpuset interrupt 26 3

If significant traffic is flowing through the unit, it is possible to notice a gradual shift of IRQ values after a few minutes by running 'diag hard sys interrupts' again. Also, CPU use on the overtaxed CPU cores should be lowered, confirm using 'get sys perf stat'.

In some cases, the only option available is a hardware upgrade, which should fully resolve the issue, if sized appropriately.

Note:

On newer platforms, Rx_Over_Errors counts the number of packets whose size exceeds 1518 bytes. It is not related to packet drops.
If further guidance is needed, contact the Fortinet Sales Representative or Fortinet Technical Support.

Technical Tip: Rx_Over_Errors on a network interface

You are leaving our website