FortiSwitch
FortiSwitch: secure, simple and scalable Ethernet solutions
Rudresh_Veerappaji
Article Id 325210
Description

This article describes how to troubleshoot issues with the Spanning Tree Protocol (STP).

Spanning Tree Protocol (STP) is a link-management protocol to enable a layer 2 loop-free topology. STP enables a network to have redundant paths for fault tolerance while ensuring there are no loops. When there are changes in the network topology like ports coming online or going down, it triggers STP to re-calculate for optimal path and reconverge.

Certain scenarios could trigger multiple STP port status changes and frequent reconvergences affecting network performance. This article describes how to troubleshoot such STP issues with examples.

Scope FortiSwitch.
Solution

Spanning Tree Protocol issues like frequent STP status flaps of ports, STP loops, frequent reconvergences, suboptimal paths, etc. typically have an underlying cause that can be traced and remediated using the recommended steps below. Some of the symptoms observed during STP issues are:

  • High usage of system resources (CPU/Memory).
  • High traffic rate.
  • Traffic loops result in floods.
  • Slow network performance and switch being inaccessible,
  • Packet drops and connectivity issues.
  • and frequent TCNs (Topology Change Notifications), among others.

 

Step 1: Review STP configurations.

 

Begin by reviewing the current FortiSwitch configurations & the existing topology - VLANs, Trunks, STP, Root/BPDU/loop guards, etc for any incorrect configurations. Simple configuration mistakes can often be overlooked. As a result, the recommendation is to first check for any config mistakes or recently made changes. FortiGate offers a real-time topology diagram of all the managed FortiSwitches in the network, which can be used to verify whether the topology is as intended. Confirm the switch ports are connected as intended for all 3 layers - Core, Aggregation, and Access layers. Review STP settings like whether STP is enabled on all the FortiSwitches & all the ports where it is expected to be configured (using a single CLI command from Fortigate 'FortiGate# diagnose switch-controller switch-info stp' or directly on each switch using the CLI command 'FortiSwitch# diagnose stp instance list').

 

Verify whether the BPDU guard is only enabled on the access switch ports (one can use a single CLI on FortiGate to check BPDU guard setting on all the connected FortiSwitches with 'FortiGate# diagnose switch-controller switch-info bpdu-guard-status' or directly on each switch using the CLI command 'FortiSwitch# diagnose bpdu-guard display status'). Verify whether Root Guard (if enabled) is only enabled on the ports that should not be root bridges. Note that FortiSwitch supports STP, MSTP, and RSTP. Review the document below for information regarding supported STP features, verify compatibility support among the protocols, and review config examples and limitations in FortiSwitch - Configuring STP settings.

 

Additionally, use the default built-in revision list feature on the FortiSwitch to quickly review the recent config changes on the FortiSwitch before starting with the troubleshooting.

Review recent configuration changes on the FortiSwitch.

 

Review the FortiSwitch Reference Architecture Guide to verify the deployed design is a supported architecture, and following the best practices suggested.

 

Step 2: Review the status of system resource usage.

 

Review CPU/Memory usage status to check if it is higher than usual. High resource usage is usually a symptom of an underlying issue, so analyze what is causing the high system resource usage. Check if the resource utilization (specifically throughput on the switch) is approaching close to the switch performance specifications. Additionally, check the resource usage specifically by the STP daemon (stpd).

 

FortiSwitch# fn top
Mem: 221244K used, 18228K free, 7160K shrd, 844K buff, 51704K cached
CPU: 5% usr 10% sys 0% nic 83% idle 0% io 0% irq 0% sirq
Load average: 2.00 2.03 2.06 1/100 760
PID PPID USER STAT VSZ %VSZ CPU %CPU COMMAND
832 2 0 DW< 0 0% 0 8% [L2ShadowTblThre]
1051 1 0 S 48700 20% 0 3% /bin/lldpmedd
1047 1 0 S 48808 20% 0 3% /bin/stpd <- Review the CPU/Mem usage of STP daemon to verify if it is among the top processes, could indicate frequent STP reconvergences and instability.
673 2 0 RW< 0 0% 0 1% [WA Monitor Thre]
1048 1 0 R 48296 20% 0 1% /bin/lpgd
1056 1 0 S 48576 20% 0 0% /bin/fortilinkd
760 704 0 R < 2756 1% 0 0% {sysctl} top
1006 1 0 S N 191m 82% 0 0% /bin/pyfcgid
1021 1 0 S 154m 66% 0 0% /bin/statsd
1055 1 0 S 54024 22% 0 0% /bin/cu_swtpd

<snippet>

 

FortiSwitch# get sys performance status
CPU states: 5% user 10% system 0% nice 85% idle
Memory states: 65% used <-

Uptime: 28 days, 6 hours, 2 minutes

 

Check the last line in the output below, which shows the overall throughput on the switch in real time. Verify if this number is higher than usual. This is discussed in more detail in the next step.

 

FortiSwitch# diagnose switch physical-ports linerate

<snippet> 

internal | 8541385 | 0.0161 Mbps || 11353924 | 0.0128 Mbps |
-----------------------------------------------------------
| 1834.0790 Mbps || | 1759.0804 Mbps | <- This is the overall throughput number on this switch at this moment. Compare this with any previous switch throughput benchmark numbers in the network.

 

Step 3: Check the traffic pattern for any anomalies, broadcast storms/traffic floods or frequent MAC moves.

 

Verify the traffic rate using the commands below to see if there is any abnormal traffic rate on any of the ports or a possible broadcast storm/flood, which could trigger issues in the network including STP reconvergences. It is useful to compare the linerate with any previously recorded benchmark numbers to verify how much higher or lower the current rate is during STP flaps. Additionally, note that the higher traffic rate itself might not always be the root cause of the issue. Instead, it could just be a symptom of another issue triggering higher traffic rates which needs to be further investigated using the next steps.

 

Note: When it is possible, use packet captures/SPAN to take a sample of the traffic on the affected FortiSwitch ports, study the captures in detail using a tool like Wireshark, and look for any anomalous traffic.

 

FortiSwitch# diagnose switch physical-ports linerate
Rate Display Mode: LINE_RATE
Port | TX Packets | TX Rate || RX Packets | RX Rate |
-----------------------------------------------------------------------------------------------
port1 | 3357836 | 18.05100 Mbps || 37194487 | 16.0421 Mbps |
port2 | 3228041 | 24.36400 Mbps || 1710406 | 21.0011 Mbps |
port3 | 0 | 0.0000 Mbps || 0 | 0.0000 Mbps |
port4 | 0 | 0.0000 Mbps || 0 | 0.0000 Mbps |
port24 |5218041221 | 826.02600 Mbps || 17104126306 | 921.0011 Mbps | <- Indicates very high traffic rate on this port compared to other ports, which may be due to a traffic flood or a possible broadcast storm.
port25 | 0 | 0.0000 Mbps || 0 | 0.0000 Mbps |
port26 | 0 | 0.0000 Mbps || 0 | 0.0000 Mbps |
port28 | 40266956 | 64.0629 Mbps || 5046341 | 67.0245 Mbps |
port38 | 0 | 0.0000 Mbps || 0 | 0.0000 Mbps |
port39 | 0 | 0.0000 Mbps || 0 | 0.0000 Mbps |
port52 | 0 | 0.0000 Mbps || 0 | 0.0000 Mbps |
internal | 8541385 | 0.0161 Mbps || 11353924 | 0.0128 Mbps |
-----------------------------------------------------------------------------------------------

| 1834.0790 Mbps || | 1759.0804 Mbps |

 

If any specific port(s) has a higher than usual traffic rate, such as port24 in the above example output, it is possible to drill down further into this port using the following command and check for broadcasts, multicasts, unknowns, drops, errors - specifically, to see if the values shown by these counters are large and increasing rapidly.

 

FortiSwitch# diagnose switch physical-ports list port24

Port(port24) is Admin up, line protocol is up
Interface Type is Serial Gigabit Media Independent Interface(SGMII/SerDes)
Address is 84:XX:XX:XX:XX:XX, None loopback
MTU 9216 bytes, Encapsulation IEEE 802.3/Ethernet-II
half-duplex, 0 Mb/s, link type is auto
input : 7052463271827484 bytes, 42623827181232 packets, 0 errors, 10157 drops, 0 oversizes
191491 unicasts, 19033905 multicasts, 23398827436 broadcasts, 0 unknowns
output : 8391321827182245 bytes, 32937412193818 packets, 0 errors, 0 drops, 0 oversizes
89378 unicasts, 107621401 multicasts, 7715282739 broadcasts
0 fragments, 0 undersizes, 0 collisions, 0 jabbers

 

Note: If loop guard has been configured (disabled by default) as discussed in section 10.5 below, look for STP loop detection related logs with the log ID 8100 in the FortiSwitch logs (FortiSwitch#execute log display). More details here - FortiSwitch STP log messages

 

MAC Address moves - Check for frequent MAC address moves (i.e switch relearning about an already learnt MAC address but from a different interface, triggering continuous updates of MAC address table & causing high switch system resource usages), which could indicate possible Layer 2 loops in the network. To verify this on the Fortiswitch, use the CLI command 'FortiSwitch# diagnose switch mac-addr list' - repeat the command a few times, and compare the MAC address to port mappings (using a simple tool like compare plugin in a text editor like Notepad++) to see if the mappings are frequently changing. In managed switch mode, a single CLI command can be run on the FortiGate to collect all MAC address tables of all the connected FortiSwitches in one go using the command 'FortiGate# diagnose switch-controller dump mac-addr'. Repeat the command a few times and compare the outputs. Refer to the following document for recommendations regarding limiting MAC address table per port if necessary: FortiSwitch Dynamic MAC address learning.

 

Step 4: Review FortiSwitch event logs.

 

If a specific FortiSwitch in the topology is already identified as a possible source of the issue, use 'FortiSwitch# execute log display' on the FortiSwitch to review the logs/events to check the pattern of STP flaps. Review logs to check the chronology of these flaps, i.e if the physical ports flap first and then STP changes status to discarding/disabled to reflect the port flap. If this is the order of events, check why the ports are flapping physically. The most common reason for STP flaps is physical port flaps (and STP just adjusting the topology to reflect these port flaps). 

 

In the example below, observe the order of events. Port 1 (1st event) is going down physically and STP is just reflecting this change of port status by moving the STP status of this port to Disabled/Discarding (2nd and 3rd events). So in this example the issue was not caused by STP itself, instead the physical port flaps first happened which then triggered STP port status changes. If STP causes flaps - it will not physically bring down the interface like what we see below, but will only print logs saying 'changed status' from forwarding to disabled or discarding, but will not bring the port itself physically down.

 

FortiSwitch# execute log filter view-lines 500

FortiSwitch# execute log display

19: 2022-09-15 05:02:18 log_id=0100001401 type=event subtype=link pri=information vd=root action="port-down" user="ctrld" unit="primary" switch.physical-port="port1" status="down" msg="primary switch port port1 has gone down" <- 7th event, a few seconds later, port1 again goes down physically. This cycle repeats, causing continuous STP flaps and reconvergences.

 

20: 2022-09-15 05:01:59 log_id=0105008255 type=event subtype=spanning_tree pri=notice vd=root user="stp_daemon" action="state-change" unit="primary" switch.physical-port="port1" instanceid="0" event="state migration" oldstate="discarding" newstate="forwarding" status="None" msg="primary port port1 instance 0 changed state from discarding to forwarding" <- 6th event, port1 is next moved to forwarding state.

 

21: 2022-09-15 05:01:57 log_id=0105008255 type=event subtype=spanning_tree pri=notice vd=root user="stp_daemon" action="role-change" unit="primary" switch.physical-port="port1" instanceid="0" event="role migration" oldrole="disabled" newrole="designated" status="None" msg="primary port port1 instance 0 changed role from disabled to designated" <- 5th event, STP now moves this port STP status to 'designated'.

 

22: 2022-09-15 05:01:57 log_id=0100001400 type=event subtype=link pri=information vd=root action="port-up" user="ctrld" unit="primary" switch.physical-port="port1" status="up" msg="primary switch port port1 has come up" <- 4th event, within a second the port1 physical link comes back up again (basically port is flapping).

 

23: 2022-09-15 05:01:57 log_id=0105008255 type=event subtype=spanning_tree pri=notice vd=root user="stp_daemon" action="state-change" unit="primary" switch.physical-port="port1" instanceid="0" event="state migration" oldstate="forwarding" newstate="discarding" status="None" msg="primary port port1 instance 0 changed state from forwarding to discarding" <- 3rd event, next STP moves this port1 to discarding state.

 

24: 2022-09-15 05:01:57 log_id=0105008255 type=event subtype=spanning_tree pri=notice vd=root user="stp_daemon" action="role-change" unit="primary" switch.physical-port="port1" instanceid="0" event="role migration" oldrole="designated" newrole="disabled" status="None" msg="primary port port1 instance 0 changed role from designated to disabled<- 2nd event, STP changes status to disabled since port1 went down in previous event.

 

25: 2022-09-15 05:01:57 log_id=0100001401 type=event subtype=link pri=information vd=root action="port-down" user="ctrld" unit="primary" switch.physical-port="port1" status="down" msg="primary switch port port1 has gone down" <- 1st event, port1 physically goes down due to link flap/SFP issue.

 

Step 5: Review crashlogs for any STP related crashes.

 

Check the crashlogs on the switch to see if any STP daemon (stpd) crashes are logged, and the frequency of these crashes. Share the crashlogs with Fortinet Support to have it decoded and analyzed further.

 

FortiSwitch# diagnose debug crashlog read

<< Snippet >>
<00179> application stpd <- STP dameon crash observed.
<00179> *** signal 11 (Segmentation fault) received ***
<00179> Register dump:
<00179> R0: XXXXXXXX R1: XXXXXXXX R3: XXXXXXXX R 5: XXXXXXXX
<00179> Trap: XXXXXXXX Error: 80000007 OldMask: 000 00000
<00179> Backtrace:
<< Snippet >>
the killed daemon is /bin/stpd: status=0xb00
stpd crashed 3 times. The latest crash was at XYZXYZXYZ <- STP daemon crashes

 

Step 6: Trace the origin of TCNs (Topology Change Notifications) in the network.

 

STP reconvergences are usually triggered by TCNs created by one or more switches in the network. Typically when a port status changes, TCNs are created and STP reconverges to reflect the change. But there are situations when TCNs would be created excessively or incorrectly which can cause repeated STP reconvergences/flaps, causing network instability. A managed FortiSwitch deployment has MSTP enabled by default, and consists of two MSTP instances:

  1. Instance 0: For data plane traffic (all ports/trunks), for all VLANs in the network (except FortiLink management VLAN).
  2. Instance 15: For control plane traffic (FortiLink/Capwap), with a FortiLink management VLAN (by default, VLAN 4094).

 

Check if the TCN Events triggered/received counters were incremented recently (in the last few minutes or hours) in either of the two instances, which would indicate STP reconvergences. Use the following command to review the diagnostic status of STP instances 0 and 15.

 

FortiSwitch# diagnose stp instance list

MST Instance Information, primary-Channel:

Instance ID 0 (CST)
Config Priority 20480
Bridge MAC abcdabcdabcd, MD5 Digest ashah31hw7a8a8s7a7sa

Root MAC abcdabcdabcd, Priority 20480, Path Cost 0, Remaining Hops 20
(This bridge is the root)

Regional Root MAC abcdabcdabcd, Priority 20480, Path Cost 0
(This bridge is the regional root)

Active Times Forward Time 15, Max Age 20, Remaining Hops 20

TCN Events Triggered 1034 (0d 3h 35m 20s ago),Received 31024(0d 0h 0m 1s ago) <----- TCN's received counter is incrementing faster and more frequently (see the last sent timer which shows 1 second ago) than TCNs triggered (locally on the switch), indicating the port flaps/topology changes are not happening on this switch, but instead happening on another switch in the topology.

<Snippet>

 

To trace the origin of TCNs, either top-down or bottom-up approach (w.r.t Core, Aggregation, Access layers) can be used depending on how much information about the issue is available at the time of troubleshooting.

 

Note: The STP status of all the FortiSwitches in a managed FortiSwitch topology can be obtained with a single command from the FortiGate CLI using 'diagnose switch-controller switch-info stp'.

 

6.1 Bottom-up approach: If specific information is available about any users/devices reporting connectivity issues during STP flaps, use its MAC/IP address information to identify which access layer switch the user device or AP is connected to - either by using the FortiGate GUI (if using FortiSwitch in managed mode), or by using the CLI as shown below:

 

FortiGate GUI: If the FortiSwitches are in managed mode, go to the FortiGate GUI -> Dashboard -> Users & Devices -> Device Inventory -> Search, and filter for the IP address or MAC address of the affected user/device and look for 'fortiswitch ports' column (disabled by default, can be added using column settings on this table). This will help identify which switch in the access layer the device is connected to. Use this switch as the starting point to trace the TCNs, use the command 'FortiSwitch# diagnose stp instance list' to check the TCN Received/Transmitted tracker in the output of this switch to analyze whether this switch is sending TCNs (which indicates topology changes being triggered on this switch) or its only receiving (from another part of the network). Continue to trace the network (using 'FortiSwitch# get switch lldp neighbor-summary' on each of the switches to find out the other neighbor switches) to look for the origin of TCNs using the same 'FortiSwitch# diagnose stp instance list' command as shown in the example outputs in previous sections.

 

FortiSwitch CLI: Alternatively, use the command output from running 'FortiGate# diagnose user device list' on the FortiGate and search for the affected user/device's IP/MAC address in the list to identify which switch it is connected to. Once identified, follow the same procedure as mentioned in the previous section to trace the origin of TCNs.

 

6.2 Top-down approach: If there is no sufficient info on which users are exactly having connectivity issues during STP flaps, start from the Core switches to trace the origin of TCNs using the output of 'FortiSwitch# diagnose stp instance list', and follow the path downstream to identify where in the network the TCNs are being generated, using 'FortiSwitch# get switch lldp neighbors-summary'.

 

Once the origin of TCNs is located, review the logs from the switch to check for frequent port flaps (Refer to Step 4). Use the cable diagnostics on the affected ports to identify possible cable issues (note that when cable diagnostics are run, it could reset the interface - so it is recommended to run this in a maintenance window). If the port is an SFP port, use the command 'get switch module summary' to check for any issues with the SFP module. 

 

FortiSwitch# get switch module summary

  Portname   State    Type       Transceiver    RX  Vendor           Part Number      Serial Number

   port25     INSERT  SFP/SFP+    10G-Base-LR    LOS FS               SFP-10GLR-31     F2031892158   <<<<<<<<<<<<<<<<<<< RX showing LOS/Loss, possible issue with SFP

  port26     INSERT  SFP/SFP+    10G-Base-LR    LOS FS               SFP-10GLR-31     F2031892157  <<<<<<<<<<<<<<<<<<<<RX showing LOS/Loss, possible issue with SFP

  port27     EMPTY

  port28     EMPTY

 

Note:

In managed mode, it is possible to use the switch-controller CLI in FortiGate to speed up collecting the 'diagnose stp instance list' output from all of the FortiSwitches in the topology in one go and then trace the origin of TCNs. Use the two commands below on the FortiGate for this task as shown:

 

FortiGate# execute switch-controller diagnose stp instance [Enter]

 

This gives the list of all switch serial numbers in the topology, copy this to a text editor like notepad++ which will give line numbers. Use this to map the serial numbers to its TCNs using the output of the next command shown below.

 

FortiGate# execute switch-controller diagnose stp instance | grep TCN

 

Copy this output again to a new tab in Notepad++ which will give the line numbers. Now, check where the TCNs are being triggered in the output, and map those line numbers with the previous output collected which has the serial numbers. This gives the potential list of switches where the TCNs are originating from.

 

Step 7: STP Root Bridge and Root ports selections.

 

The root bridge should be the FortiSwitches in the top-most layer/tier (i.e tier-1) in the switch topology when in managed mode. Use the 'FortiSwitch# diagnose stp instance list' command to verify which switch has the root switch role, and confirm it is one of the FortiSwitches in the top of the topology (tier-1 if MCLAG is being used). Tune the STP priority (lowest priority value wins the election) as needed to ensure the right switch in the topology is elected as the root bridge, and the rest of the switches should have the corresponding root ports pointing to the root switch.

 

FortiSwitch# diagnose stp instance list

MST Instance Information, primary-Channel:

Instance ID 0 (CST)
Config Priority 20480
Bridge MAC abcdabcdabcd, MD5 Digest ashah31hw7a8a8s7a7sa

Root MAC abcdabcdabcd, Priority 20480, Path Cost 0, Remaining Hops 20
(This bridge is the root) <- This output confirms that this top layer switch is the root bridge.

Regional Root MAC abcdabcdabcd, Priority 20480, Path Cost 0
(This bridge is the regional root)

Active Times Forward Time 15, Max Age 20, Remaining Hops 20

TCN EventsTriggered 1034 (0d 3h 35m 20s ago),Received 31024(0d 0h 0m 1s ago)

Port Speed Cost Priority Role State HelloTime Flags
________________ ______ _________ _________ ___________ __________ _________ _______________

port1 10G 2000 128 DESIGNATED FORWARDING 2 EN ED
port2 10G 2000 128 DESIGNATED FORWARDING 2 EN ED

<<snippet>>

 

Verify that the root port on each of the non-root switches in the topology is pointing towards the root bridge correctly and in the 'Forwarding' state, using the same command 'diagnose stp instance list' in each of the switches. In an MCLAG setup, the root port is usually the MLAG uplink (with the name '_FlInK1_MLAG0_') on all of the non-root switches in the topology. As shown in the example below, verify the same thing.

 

FortiSwitch-Tier-2# diagnose stp instance list

<<Snippet>>

Port Speed Cost Priority Role State HelloTime Flags
________________ ______ _________ _________ ___________ __________ _________ _______________

port1 10G 2000 128 DESIGNATED FORWARDING 2 EN ED
port2 10G 2000 128 DESIGNATED FORWARDING 2 EN ED
port14 - 200000000 128 DISABLED DISCARDING 2 ED
internal 1G 20000 128 DESIGNATED FORWARDING 2 ED
_FlInK1_MLAG0_ 20G 1 128 ROOT FORWARDING 2 EN <- _FlInK1_MLAG0_ is the Root port, and is in the Forwarding state.

_FlInK1_ICL0_ 20G 1 128 DESIGNATED FORWARDING 2 EN

 

Step 8: Review the PDU counters on the FortiSwitch.

 

Check the PDU counter list output to look for any abnormally high counters for STP and other protocols on any of the ports, or traffic for protocols that are not expected on those ports. If any specific counter has a large number and is increasing very frequently, a sample of packets on the corresponding port can be collected using port mirroring/SPAN to analyze further.

 

FortiSwitch# diagnose switch pdu-counters list

primary CPU counters:
packet receive error : 0 <-
Non-zero port counters:
port1:

LACP packet : 829988
STP packet : 627889 <-
LLDP packet : 45179
unknown packet type : 177795 <-
FortiLink Discovery Resp : 88855
FortiLink Join Resp : 88928
FortiLink Echo Resp : 12
IGMPv3 Membership Report : 31135
port2:

LACP packet : 829988
STP packet : 31042
LLDP packet : 2285
unknown packet type : 22705
FortiLink Discovery Resp : 1
FortiLink Join Resp : 1
FortiLink Echo Resp : 22703
IGMPv3 Membership Report : 23

 

<< Snippet >>
port23:
LACP packet : 81099
STP packet : 1211760 <-
LLDP packet : 889515
unknown packet type : 662498 <-
IGMPv3 Membership Report : 8
unknown/non-switch port:
FortiLink Discovery Resp : 5
FortiLink Join Resp : 5
FortiLink Echo Resp : 94522
Capwap Discovery Resp : 9
Capwap Join Resp : 3
Capwap Echo Resp : 5898
Capwap WTP Event Resp : 17704
Capwap Cfg Status Resp : 3
Capwap Chg State Event Resp : 3
Capwap Cfg Update Req : 3
Capwap POE Gen Stats Req : 11796
Capwap Telemetry Resp : 7567

 

Step 9: STP debugs for additional troubleshooting.

 

If the previous steps have not already resulted in identifying the cause of the STP issues, debugs can be used on the switch where the issue is suspected to be originating.

 

Caution:

The STP debugs are very verbose: do not use the full debug level in a production environment (unless guided by Fortinet Support for specific scenarios and during non-production hours). Instead, use STP debugs in brief mode (like level 2) - during a maintenance window and under the guidance of Fortinet support.

 

FortiSwitch# diagnose debug reset

FortiSwitch# diagnose debug application stpd 2 <- Level 2 is usually sufficient for important STP messages.

FortiSwitch# diagnose debug enable

 

FortiSwitch# diagnose debug disable <- Disable the debugs after the activity is complete.

 

Step 10: Common Triggers and recommendations.

 

Below is a list of common triggers that could cause STP issues and some recommendations. Note that the list is not exhaustive, but is a helpful checklist to review when encountering STP issues.

 

10.1 Flap-guard: 

 

If there are frequent port flaps observed from steps 4 and 6, use flap-guard to avoid frequent STP reconvergences.

Review SFP/cable connections for incorrect cablings and faulty SFPs which can cause continuous & frequent port flaps. Use flap-guard if port flaps are happening frequently in the deployment to avoid continuous STP reconvergences. More details are available in FortiSwitch Flap Guard configurations.

 

10.2 Storm control: 

 

FortiSwitch can be configured with storm control to drop the excess traffic when traffic rate increases beyond a threshold (which can be configured based on expected max throughput per port) on the switch ports, and thus reduce the impact on system resources in case of a broadcast storm or loop in the network. More details about this feature are available in FortiSwitch Storm Control configurations.

 

10.3 Root Guard:

 

Any port that receives a superior BPDU can cause it to become the root port. To enforce an intended topology and a perimeter which would be consistent, use the Root Guard to prevent certain ports from becoming the root ports (i.e the path to the Root Bridge). More details about this feature are available in FortiSwitch STP Root Guard configuration.

 

10.4 BPDU Guard:

 

In a typical network, the user-facing ports (essentially the Edge ports or Access ports) in the network should not participate in STP and should not send BPDUs, to mantain a stable and consistent STP topology. To enforce this, BPDU Guard can be configured on these user-facing ports, which will cause the ports to go down for small amount of time if BPDUs are recieved on those ports. More details are available in FortiSwitch STP BPDU Guard configuration.

 

10.5 Loop Guard:

 

Loop Guard can be configured to help with stopping Broadcast storms due to L2 loops. When it is enabled on the switch ports, it monitors the network for any downstream loops and puts the port out of service to protect the network until the loop is alleviated. More details on how loop guard functions and configurations examples are available in FortiSwitch Loop Guard and STP loop detection log examples

 

10.6 Root Bridge & Root Ports:

 

Ensure the root bridge and root ports selection are optimal and as expected in the topology. In a managed FortiSwitch topology, one of the switches in the topmost tier/layer (i.e. which are directly connected to the FortiGate) should ideally be the root bridge. It is recommended to ensure the root bridge selection is triggered based on the configured root priority (lowest root priority value wins), instead of letting MAC address decide the root bridge. This is so that the topology is optimal and consistent, independent of any new switches added to the network whose MAC address could be lower than the current root bridge and hence could take over the role.

 

In a managed switch topology (i.e. FortiLink on FortiGate), if the topology has a combination of FortiSwitches and other vendor switches, the topmost tier/layer FortiSwitch (which is directly connected to the FortiSwitch) should ideally be the root bridge in the topology. Adjust the priority such that the correct switch becomes the root bridge. Also, ensure the root ports on each of the downstream switches are as expected (typically the upstream port).

 

In an MCLAG setup, ideally, the root port is the MLAG uplink (with the name '_FlInK1_MLAG0_') on all the non-root switches in the topology.

 

10.7 Automation stitching:

 

Use FortiGate automation stitching on the FortiGate (if using managed switch mode) to parse the logs for frequent and large TCNs in the topology. When identified, actions can be taken proactively to further troubleshoot. More details on how to use a log entry to create an automation stitch to trigger an alert are here - FortiGate Automation Stitching.

 

10.8 Upgrade Fortiswitch to supported & latest interim versions:

 

Keep the FortiSwitch firmware up to date by upgrading to the latest interim release available, since any known STP defect in the older versions will be addressed in the latest interim releases.

 

10.9 Compatibility issues:

 

Review the document below for information regarding the supported STP features, verify compatibility between the protocols and limitations. Refer to FortiSwitch - Configuring STP settings.

 

10.10 Shutdown edge ports not in use:

 

It is recommended to ensure the edge ports/access layer ports that are not in use be shutdown, and enabled only as necessary. This recommendation along with BPDU guard on the access layer ports helps with reducing STP issues.

 

10.11 Disable PoE status on ports where it is not needed:

 

PoE typically is needed for devices connecting to the access layer ports. Another best practice is to disable poe status on Core and Aggregation layer switches (enable as needed later), and have them enabled only on access layer switches. This helps in avoiding any PSE to PSE voltage injection issues. More details are available in Power Fault: Error Type 36 (Port is off: Voltage injection into the port)' on FortiSwitch.

 

10.12 MCLAG deployment:

 

If MCLAG is being used on the FortiSwitches, verify the following STP settings are configured as it is a prerequisite for MCLAG config:

  1. 'mclag-stp-aware' must be enabled on the global switch level.
  2. STP must be configured on the ICL trunks.

Both of the above settings are enabled by default. Verify that they are not disabled. More details on this requirement are provided in FortiSwitch - Deploying MCLAG Topologies.

 

Related documents:

Fortinet - Switching Reference Architecture Guide

ForitSwitch Administration guide - STP

FortiSwitch Fortilink guide - STP Settings

FortiSwitch - Transceivers compatibility guide

Comments
Adolfo_Z_H
Staff
Staff

Nice article. Congratulations.