30 seconds of packet loss when switch is re-connected
I have a large network which is described in the image below. The image is simplified. In my real network there are about 40 switches like "Switch C" -- switches that are connected to both "Switch A" and "Switch B". Also I have a Fortigate HA pair (Active/Passive) connected to "Switch A" and "Switch B". Running firmware 6.2.2
In my network, I can ping from "Host 1" to "Host 2". If I pull the power from "Switch D", my ping from "Host 1" to "Host 2" continues to work. This is expected. However, when I re-apply the power to "Switch D", and after "Switch D" boots, pings from "Host 1" to "Host 2" become sporadic. This "packet loss" lasts for about 30 seconds. Then the pings return to normal.
Can anyone tell me (or guess) what the problem is or how I could go about debugging? I duplicated the network with a spare pair of Fortigates and 448/224 switches and I was unable to reproduce.
With 40+ switches I am pretty sure you have a central log storage for all of them, I'd check what log level is needed to catch STP port/VLAN state changes (Forwarding/Blocking) and look for them in the logs. Problem that occurs every 30 seconds each time seems more like a timers issue, especially with Spanning Tree Convergence times.
Thank you all for your replies. A couple of comments which I should have included in my original post:
The dropped traffic between "Host 1" and "Host 2" is on the same VLAN. No routing. The fortigate does not see the traffic (in the logs), which is what I would expect. So I doubt the firewalls are playing any direct role in the problem
@vponmuniraj what do you mean by "route changes"? Are you talking about Layer 3? Regardless, how do I monitor "route changes" on the switches or firewalls?
@Toshi_Esumi I have two Fortigates (running 6.2.2) in HA Active-Passive. Each fortigate is connected to both "Switch A" and "Switch B". Similar to how the hosts are connected to these switches.
@Yurisk I suspect STP or some fortinet ISL equivalent also. Particularly in light of item 5 below. However, the problem does not happen regularly at 30 second intervals. When "Switch D" is connected to "Switch C", the traffic between "Host 1" and "Host 2" becomes "spotty"... and this misbehavior lasts for only about 30-45 seconds. After this time, the traffic returns to normal and there is no more trouble. The switch logs show nothing interesting. They do show STP messages when individual ports come up and down.
The problem appears in an additional way also. Please see the imaged I pasted below. If I leave all switches connected and powered up... but disconnect and reconnect each of the ISLs listed below, 1-4 (waiting one minute between each disconnect or connection), the problem (packet loss between "Host 1" and "Host 2") will also happen temporarily.
My sales engineer suggested that a "loop" is forming between switches C and D, which doesn't make sense to me.
Complicating this: I'm working in an air gapped environment so getting support is difficult. I tried to replicate with spare fortigates and switches, but so far I'm unsuccessful.
Does anyone know how I can diagnose/trace the "logic" that the switches go through when "Switch D" joins the topology or when ISL connections come up and down?