Hello. I have some test switches (248E-FPOE). I put them into standalone mode (factory reset and then reconfigured). Some of the switches are on 7.2.1 and others on 7.2.4. The switches are connected to each other, and only each other. There are no redundant links.
The problem is that about every three or so hours the switches cpu load spikes to ~90% for 20 or 30 minutes and then goes back down. I tried disconnecting on of the switches from the others, and it still is doing this. Looking at the logs, they just seem to spike and them cpu utilization logs get generated.
Not sure what might be causing this? I'll look into one more thing tomorrow, but I have my doubts. Was curious if anyone has ran into anything similar.
**Note: Only having this issue after putting them in standalone mode.
Check for network loops or excessive broadcast storms within your network. These issues can lead to increased CPU utilization as the switches try to handle the excessive traffic. Ensure that there are no redundant links or misconfigured spanning tree settings that could cause loops.
Use network monitoring tools or packet capture utilities to analyze the network traffic passing through the switches during the CPU spikes. Look for any unusual or excessive traffic patterns that could be causing the high CPU utilization. Identify the source of the traffic and investigate if it's normal or requires further troubleshooting. Regards, Shilpa C P
Currently these switches are only connected to each other. I am certain there are no redundant links nor loops. I would like to note that these spikes happen every four hours and last almost roughly 30 minutes. I have attached an image of the cpu performance from the gui. Also note that the traffic (bandwidth) is stable throughout each spike (the bandwidth being low at the beginning was because I had temporarily disabled the ports).Note CPU relative to Bandwidth
I conducted a packet capture and there isn't much. Mostly just LLDP broadcasts roughly every three seconds or so. There were a few BOOTP packets coming from the internal interfaces as they have no IP address, yet are set to use 'DHCP' (there is no dhcp server).
After doing this I thought to use some commands to see the performance. The data here was interesting.
When CPU usage was 'normal':
SW2 # get system performance stat CPU states: 7% user 33% system 0% nice 60% idle Memory states: 48% used Uptime: 14 days, 2 hours, 3 minutes
SW2 # get system performance top
Run Time: 14 days, 2 hours and 2 minutes 8U, 28S, 64I; 487T, 231F lldpmedd 1091 S 3.7 1.9 alertd 1038 S 3.1 1.5 ctrld 1081 S 2.7 1.6 stpd 1082 S 2.3 1.8 fortilinkd 1099 S 1.9 1.7 l2d 1086 S 0.5 1.6 lpgd 1083 S 0.5 1.6 dmid 1092 S 0.5 1.5 newcli 590 R 0.5 1.5 poed 1057 S 0.3 1.4 sshd 562 S 0.1 1.8 acld 1044 S 0.1 1.6 l2dbg 1087 S 0.1 1.6 pyfcgid 1025 S N 0.0 8.4 cmdbsvr 939 S 0.0 2.7 cu_swtpd 1097 S 0.0 2.3 httpsd 1028 S N 0.0 2.3 initXXXXXXXXXXX 1 S 0.0 2.2 newcli 563 S 0.0 2.1 httpsd 1248 S N 0.0 2.1
When CPU usage was spiking:
SW2 # get system performance stat CPU states: 13% user 55% system 0% nice 32% idle Memory states: 48% used Uptime: 14 days, 2 hours, 33 minutes
SW2 # get system performance top
Run Time: 14 days, 2 hours and 33 minutes 10U, 59S, 31I; 487T, 232F snmpd 1589 S 31.5 2.0 lldpmedd 1091 S 3.7 1.9 ctrld 1081 S 2.9 1.6 fortilinkd 1099 S 2.7 1.7 alertd 1038 S 2.7 1.5 stpd 1082 S 1.1 1.8 l2dbg 1087 S 0.9 1.6 newcli 1440 R 0.9 1.5 cu_swtpd 1097 S 0.3 2.3 l2d 1086 S 0.3 1.6 initXXXXXXXXXXX 1 S 0.1 2.2 ipconflictd 1088 S 0.1 2.0 igmpsnoopingd 1042 S N 0.1 1.9 sshd 1390 S 0.1 1.8 lpgd 1083 S 0.1 1.6 lfgd 1036 S 0.1 1.5 dmid 1092 S 0.1 1.5 poed 1057 S 0.1 1.4 pyfcgid 1025 S N 0.0 8.4 cmdbsvr 939 S 0.0 2.7
I would like to test disabling snmp to see if that fixes it, or at least disabling the agent. I will most likely disable the agent later today.
Also these switches are stand-alone, so I am unsure why the fortilink process shows in the top.
I would also like to ask, when in the gui the 'nice' cpu state increases to sometimes over 20%, making the cpu usage quite high. I believe this usage falls under the 'pyfcgid' process, but I am unsure. Is this normal?
Here's an example of this:
CPU states: 13% user 51% system 35% nice 1% idle Memory states: 48% used Uptime: 14 days, 2 hours, 15 minutes
Run Time: 14 days, 2 hours and 13 minutes 17U, 70S, 13I; 487T, 229F snmpd 1125 R 27.1 2.1 pyfcgid 1082 S N 24.4 9.1 lldpmedd 1149 S 3.5 2.0
May I ask what further troubleshooting steps I could take? Upgrading to 7.4.0 seemed to have corrected the issue on three of the five switches I am testing with.
I pasted the configs into these switches (a couple hundred lines at a time) from a switch that was managed by a fortigate. These switches I am testing on are standalone. When pasting in the configs I did remove the line to enable fortilink and also disabled auto-network. Could pasting this config be part of the issue?
The Fortinet Security Fabric brings together the concepts of convergence and consolidation to provide comprehensive cybersecurity protection for all users, devices, and applications and across all network edges.