Created on 07-29-2024 07:51 AM Edited on 11-20-2024 01:33 PM
Description |
This article describes how to troubleshoot issues with the Spanning Tree Protocol (STP). Spanning Tree Protocol (STP) is a link-management protocol to enable a layer 2 loop-free topology. STP enables a network to have redundant paths for fault tolerance while ensuring there are no loops. When there are changes in the network topology like ports coming online or going down, it triggers STP to re-calculate for optimal path and reconverge. Certain scenarios could trigger multiple STP port status changes and frequent reconvergences affecting network performance. This article describes how to troubleshoot such STP issues with examples. |
Scope | FortiSwitch. |
Solution |
Spanning Tree Protocol issues like frequent STP status flaps of ports, STP loops, frequent reconvergences, suboptimal paths, etc. typically have an underlying cause that can be traced and remediated using the recommended steps below. Some of the symptoms observed during STP issues are:
Step 1: Review STP configurations.
Begin by reviewing the current FortiSwitch configurations & the existing topology - VLANs, Trunks, STP, Root/BPDU/loop guards, etc for any incorrect configurations. Simple configuration mistakes can often be overlooked. As a result, the recommendation is to first check for any config mistakes or recently made changes. FortiGate offers a real-time topology diagram of all the managed FortiSwitches in the network, which can be used to verify whether the topology is as intended. Confirm the switch ports are connected as intended for all 3 layers - Core, Aggregation, and Access layers. Review STP settings like whether STP is enabled on all the FortiSwitches & all the ports where it is expected to be configured (using a single CLI command from Fortigate 'FortiGate# diagnose switch-controller switch-info stp' or directly on each switch using the CLI command 'FortiSwitch# diagnose stp instance list').
Verify whether the BPDU guard is only enabled on the access switch ports (one can use a single CLI on FortiGate to check BPDU guard setting on all the connected FortiSwitches with 'FortiGate# diagnose switch-controller switch-info bpdu-guard-status' or directly on each switch using the CLI command 'FortiSwitch# diagnose bpdu-guard display status'). Verify whether Root Guard (if enabled) is only enabled on the ports that should not be root bridges. Note that FortiSwitch supports STP, MSTP, and RSTP. Review the document below for information regarding supported STP features, verify compatibility support among the protocols, and review config examples and limitations in FortiSwitch - Configuring STP settings.
Additionally, use the default built-in revision list feature on the FortiSwitch to quickly review the recent config changes on the FortiSwitch before starting with the troubleshooting. Review recent configuration changes on the FortiSwitch.
Review the FortiSwitch Reference Architecture Guide to verify the deployed design is a supported architecture, and following the best practices suggested.
Step 2: Review the status of system resource usage.
Review CPU/Memory usage status to check if it is higher than usual. High resource usage is usually a symptom of an underlying issue, so analyze what is causing the high system resource usage. Check if the resource utilization (specifically throughput on the switch) is approaching close to the switch performance specifications. Additionally, check the resource usage specifically by the STP daemon (stpd).
FortiSwitch# fn top <snippet>
FortiSwitch# get sys performance status Uptime: 28 days, 6 hours, 2 minutes
Check the last line in the output below, which shows the overall throughput on the switch in real time. Verify if this number is higher than usual. This is discussed in more detail in the next step.
FortiSwitch# diagnose switch physical-ports linerate <snippet> internal | 8541385 | 0.0161 Mbps || 11353924 | 0.0128 Mbps |
Step 3: Check the traffic pattern for any anomalies, broadcast storms/traffic floods or frequent MAC moves.
Verify the traffic rate using the commands below to see if there is any abnormal traffic rate on any of the ports or a possible broadcast storm/flood, which could trigger issues in the network including STP reconvergences. It is useful to compare the linerate with any previously recorded benchmark numbers to verify how much higher or lower the current rate is during STP flaps. Additionally, note that the higher traffic rate itself might not always be the root cause of the issue. Instead, it could just be a symptom of another issue triggering higher traffic rates which needs to be further investigated using the next steps.
Note: When it is possible, use packet captures/SPAN to take a sample of the traffic on the affected FortiSwitch ports, study the captures in detail using a tool like Wireshark, and look for any anomalous traffic.
FortiSwitch# diagnose switch physical-ports linerate | 1834.0790 Mbps || | 1759.0804 Mbps |
If any specific port(s) has a higher than usual traffic rate, such as port24 in the above example output, it is possible to drill down further into this port using the following command and check for broadcasts, multicasts, unknowns, drops, errors - specifically, to see if the values shown by these counters are large and increasing rapidly.
FortiSwitch# diagnose switch physical-ports list port24 Port(port24) is Admin up, line protocol is up
Note: If loop guard has been configured (disabled by default) as discussed in section 10.5 below, look for STP loop detection related logs with the log ID 8100 in the FortiSwitch logs (FortiSwitch#execute log display). More details here - FortiSwitch STP log messages
MAC Address moves - Check for frequent MAC address moves (i.e switch relearning about an already learnt MAC address but from a different interface, triggering continuous updates of MAC address table & causing high switch system resource usages), which could indicate possible Layer 2 loops in the network. To verify this on the Fortiswitch, use the CLI command 'FortiSwitch# diagnose switch mac-addr list' - repeat the command a few times, and compare the MAC address to port mappings (using a simple tool like compare plugin in a text editor like Notepad++) to see if the mappings are frequently changing. In managed switch mode, a single CLI command can be run on the FortiGate to collect all MAC address tables of all the connected FortiSwitches in one go using the command 'FortiGate# diagnose switch-controller dump mac-addr'. Repeat the command a few times and compare the outputs. Refer to the following document for recommendations regarding limiting MAC address table per port if necessary: FortiSwitch Dynamic MAC address learning.
Step 4: Review FortiSwitch event logs.
If a specific FortiSwitch in the topology is already identified as a possible source of the issue, use 'FortiSwitch# execute log display' on the FortiSwitch to review the logs/events to check the pattern of STP flaps. Review logs to check the chronology of these flaps, i.e if the physical ports flap first and then STP changes status to discarding/disabled to reflect the port flap. If this is the order of events, check why the ports are flapping physically. The most common reason for STP flaps is physical port flaps (and STP just adjusting the topology to reflect these port flaps).
In the example below, observe the order of events. Port 1 (1st event) is going down physically and STP is just reflecting this change of port status by moving the STP status of this port to Disabled/Discarding (2nd and 3rd events). So in this example the issue was not caused by STP itself, instead the physical port flaps first happened which then triggered STP port status changes. If STP causes flaps - it will not physically bring down the interface like what we see below, but will only print logs saying 'changed status' from forwarding to disabled or discarding, but will not bring the port itself physically down.
FortiSwitch# execute log filter view-lines 500 FortiSwitch# execute log display 19: 2022-09-15 05:02:18 log_id=0100001401 type=event subtype=link pri=information vd=root action="port-down" user="ctrld" unit="primary" switch.physical-port="port1" status="down" msg="primary switch port port1 has gone down" <- 7th event, a few seconds later, port1 again goes down physically. This cycle repeats, causing continuous STP flaps and reconvergences.
20: 2022-09-15 05:01:59 log_id=0105008255 type=event subtype=spanning_tree pri=notice vd=root user="stp_daemon" action="state-change" unit="primary" switch.physical-port="port1" instanceid="0" event="state migration" oldstate="discarding" newstate="forwarding" status="None" msg="primary port port1 instance 0 changed state from discarding to forwarding" <- 6th event, port1 is next moved to forwarding state.
21: 2022-09-15 05:01:57 log_id=0105008255 type=event subtype=spanning_tree pri=notice vd=root user="stp_daemon" action="role-change" unit="primary" switch.physical-port="port1" instanceid="0" event="role migration" oldrole="disabled" newrole="designated" status="None" msg="primary port port1 instance 0 changed role from disabled to designated" <- 5th event, STP now moves this port STP status to 'designated'.
22: 2022-09-15 05:01:57 log_id=0100001400 type=event subtype=link pri=information vd=root action="port-up" user="ctrld" unit="primary" switch.physical-port="port1" status="up" msg="primary switch port port1 has come up" <- 4th event, within a second the port1 physical link comes back up again (basically port is flapping).
23: 2022-09-15 05:01:57 log_id=0105008255 type=event subtype=spanning_tree pri=notice vd=root user="stp_daemon" action="state-change" unit="primary" switch.physical-port="port1" instanceid="0" event="state migration" oldstate="forwarding" newstate="discarding" status="None" msg="primary port port1 instance 0 changed state from forwarding to discarding" <- 3rd event, next STP moves this port1 to discarding state.
24: 2022-09-15 05:01:57 log_id=0105008255 type=event subtype=spanning_tree pri=notice vd=root user="stp_daemon" action="role-change" unit="primary" switch.physical-port="port1" instanceid="0" event="role migration" oldrole="designated" newrole="disabled" status="None" msg="primary port port1 instance 0 changed role from designated to disabled" <- 2nd event, STP changes status to disabled since port1 went down in previous event.
25: 2022-09-15 05:01:57 log_id=0100001401 type=event subtype=link pri=information vd=root action="port-down" user="ctrld" unit="primary" switch.physical-port="port1" status="down" msg="primary switch port port1 has gone down" <- 1st event, port1 physically goes down due to link flap/SFP issue. Step 5: Review crashlogs for any STP related crashes.
Check the crashlogs on the switch to see if any STP daemon (stpd) crashes are logged, and the frequency of these crashes. Share the crashlogs with Fortinet Support to have it decoded and analyzed further.
FortiSwitch# diagnose debug crashlog read << Snippet >>
Step 6: Trace the origin of TCNs (Topology Change Notifications) in the network.
STP reconvergences are usually triggered by TCNs created by one or more switches in the network. Typically when a port status changes, TCNs are created and STP reconverges to reflect the change. But there are situations when TCNs would be created excessively or incorrectly which can cause repeated STP reconvergences/flaps, causing network instability.
FortiSwitch# diagnose stp instance list MST Instance Information, primary-Channel: Instance ID 0 (CST) Root MAC abcdabcdabcd, Priority 20480, Path Cost 0, Remaining Hops 20 Regional Root MAC abcdabcdabcd, Priority 20480, Path Cost 0 Active Times Forward Time 15, Max Age 20, Remaining Hops 20 TCN Events Triggered 1034 (0d 3h 35m 20s ago),Received 31024(0d 0h 0m 1s ago) <----- TCN's received counter is incrementing faster and more frequently (see the last sent timer which shows 1 second ago) than TCNs triggered (locally on the switch), indicating the port flaps/topology changes are not happening on this switch, but instead happening on another switch in the topology.
To trace the origin of TCNs, either top-down or bottom-up approach (w.r.t Core, Aggregation, Access layers) can be used depending on how much information about the issue is available at the time of troubleshooting.
6.1 Bottom-up approach: If specific information is available about any users/devices reporting connectivity issues during STP flaps, use its MAC/IP address information to identify which access layer switch the user device or AP is connected to - either by using the FortiGate GUI (if using FortiSwitch in managed mode), or by using the CLI as shown below:
FortiGate GUI: If the FortiSwitches are in managed mode, go to the FortiGate GUI -> Dashboard -> Users & Devices -> Device Inventory -> Search, and filter for the IP address or MAC address of the affected user/device and look for 'fortiswitch ports' column (disabled by default, can be added using column settings on this table). This will help identify which switch in the access layer the device is connected to. Use this switch as the starting point to trace the TCNs, use the command 'FortiSwitch# diagnose stp instance list' to check the TCN Received/Transmitted tracker in the output of this switch to analyze whether this switch is sending TCNs (which indicates topology changes being triggered on this switch) or its only receiving (from another part of the network). Continue to trace the network (using 'FortiSwitch# get switch lldp neighbor-summary' on each of the switches to find out the other neighbor switches) to look for the origin of TCNs using the same 'FortiSwitch# diagnose stp instance list' command as shown in the example outputs in previous sections.
FortiSwitch CLI: Alternatively, use the command output from running 'FortiGate# diagnose user device list' on the FortiGate and search for the affected user/device's IP/MAC address in the list to identify which switch it is connected to. Once identified, follow the same procedure as mentioned in the previous section to trace the origin of TCNs.
6.2 Top-down approach: If there is no sufficient info on which users are exactly having connectivity issues during STP flaps, start from the Core switches to trace the origin of TCNs using the output of 'FortiSwitch# diagnose stp instance list', and follow the path downstream to identify where in the network the TCNs are being generated, using 'FortiSwitch# get switch lldp neighbors-summary'.
Once the origin of TCNs is located, review the logs from the switch to check for frequent port flaps (Refer to Step 4). Use the cable diagnostics on the affected ports to identify possible cable issues (note that when cable diagnostics are run, it could reset the interface - so it is recommended to run this in a maintenance window). If the port is an SFP port, use the command 'get switch module summary' to check for any issues with the SFP module.
FortiSwitch# get switch module summary Portname State Type Transceiver RX Vendor Part Number Serial Number port25 INSERT SFP/SFP+ 10G-Base-LR LOS FS SFP-10GLR-31 F2031892158 <<<<<<<<<<<<<<<<<<< RX showing LOS/Loss, possible issue with SFP port26 INSERT SFP/SFP+ 10G-Base-LR LOS FS SFP-10GLR-31 F2031892157 <<<<<<<<<<<<<<<<<<<<RX showing LOS/Loss, possible issue with SFP port27 EMPTY port28 EMPTY
Note: In managed mode, it is possible to use the switch-controller CLI in FortiGate to speed up collecting the 'diagnose stp instance list' output from all of the FortiSwitches in the topology in one go and then trace the origin of TCNs. Use the two commands below on the FortiGate for this task as shown:
FortiGate# execute switch-controller diagnose stp instance [Enter]
This gives the list of all switch serial numbers in the topology, copy this to a text editor like notepad++ which will give line numbers. Use this to map the serial numbers to its TCNs using the output of the next command shown below.
FortiGate# execute switch-controller diagnose stp instance | grep TCN
Copy this output again to a new tab in Notepad++ which will give the line numbers. Now, check where the TCNs are being triggered in the output, and map those line numbers with the previous output collected which has the serial numbers. This gives the potential list of switches where the TCNs are originating from.
Step 7: STP Root Bridge and Root ports selections.
The root bridge should be one of the FortiSwitches in the top-most layer/tier in the switch topology when in managed mode. Use the 'FortiSwitch# diagnose stp instance list' command to verify which switch has the root switch role, and confirm it is one of the FortiSwitches in the top of the topology (tier-1 if MCLAG is being used). Tune the STP priority (lowest priority value wins the election) as needed to ensure the right switch in the topology is elected as the root bridge, and the rest of the switches should have the corresponding root ports pointing to the root switch.
FortiSwitch# diagnose stp instance list MST Instance Information, primary-Channel: Instance ID 0 (CST) Root MAC abcdabcdabcd, Priority 20480, Path Cost 0, Remaining Hops 20 Regional Root MAC abcdabcdabcd, Priority 20480, Path Cost 0 Active Times Forward Time 15, Max Age 20, Remaining Hops 20 TCN EventsTriggered 1034 (0d 3h 35m 20s ago),Received 31024(0d 0h 0m 1s ago) Port Speed Cost Priority Role State HelloTime Flags port1 10G 2000 128 DESIGNATED FORWARDING 2 EN ED <<snippet>>
Verify that the root port on each of the non-root switches in the topology is pointing towards the root bridge correctly and in the 'Forwarding' state, using the same command 'diagnose stp instance list' in each of the switches. In an MCLAG setup, the root port is usually the MLAG uplink (with the name '_FlInK1_MLAG0_') on all of the non-root switches in the topology. As shown in the example below, verify the same thing.
FortiSwitch-Tier-2# diagnose stp instance list <<Snippet>> Port Speed Cost Priority Role State HelloTime Flags port1 10G 2000 128 DESIGNATED FORWARDING 2 EN ED _FlInK1_ICL0_ 20G 1 128 DESIGNATED FORWARDING 2 EN
Step 8: Review the PDU counters on the FortiSwitch.
Check the PDU counter list output to look for any abnormally high counters for STP and other protocols on any of the ports, or traffic for protocols that are not expected on those ports. If any specific counter has a large number and is increasing very frequently, a sample of packets on the corresponding port can be collected using port mirroring/SPAN to analyze further.
FortiSwitch# diagnose switch pdu-counters list primary CPU counters: LACP packet : 829988 LACP packet : 829988
<< Snippet >>
Step 9: STP debugs for additional troubleshooting.
If the previous steps have not already resulted in identifying the cause of the STP issues, debugs can be used on the switch where the issue is suspected to be originating.
Caution: The STP debugs are very verbose: do not use the full debug level in a production environment (unless guided by Fortinet Support for specific scenarios and during non-production hours). Instead, use STP debugs in brief mode (like level 2) - during a maintenance window and under the guidance of Fortinet support.
FortiSwitch# diagnose debug reset FortiSwitch# diagnose debug application stpd 2 <- Level 2 is usually sufficient for important STP messages. FortiSwitch# diagnose debug enable
FortiSwitch# diagnose debug disable <- Disable the debugs after the activity is complete.
Step 10: Common Triggers and recommendations.
Below is a list of common triggers that could cause STP issues and some recommendations. Note that the list is not exhaustive, but is a helpful checklist to review when encountering STP issues.
10.1 Flap-guard:
If there are frequent port flaps observed from steps 4 and 6, use flap-guard to avoid frequent STP reconvergences. Review SFP/cable connections for incorrect cablings and faulty SFPs which can cause continuous & frequent port flaps. Use flap-guard if port flaps are happening frequently in the deployment to avoid continuous STP reconvergences. More details are available in FortiSwitch Flap Guard configurations.
10.2 Storm control:
FortiSwitch can be configured with storm control to drop the excess traffic when traffic rate increases beyond a threshold (which can be configured based on expected max throughput per port) on the switch ports, and thus reduce the impact on system resources in case of a broadcast storm or loop in the network. More details about this feature are available in FortiSwitch Storm Control configurations.
10.3 Root Guard:
Any port that receives a superior BPDU can cause it to become the root port. To enforce an intended topology and a perimeter which would be consistent, use the Root Guard to prevent certain ports from becoming the root ports (i.e the path to the Root Bridge). More details about this feature are available in FortiSwitch STP Root Guard configuration.
10.4 BPDU Guard:
In a typical network, the user-facing ports (essentially the Edge ports or Access ports) in the network should not participate in STP and should not send BPDUs, to mantain a stable and consistent STP topology. To enforce this, BPDU Guard can be configured on these user-facing ports, which will cause the ports to go down for small amount of time if BPDUs are recieved on those ports. More details are available in FortiSwitch STP BPDU Guard configuration.
10.5 Loop Guard:
Loop Guard can be configured to help with stopping Broadcast storms due to L2 loops. When it is enabled on the switch ports, it monitors the network for any downstream loops and puts the port out of service to protect the network until the loop is alleviated. More details on how loop guard functions and configurations examples are available in FortiSwitch Loop Guard and STP loop detection log examples
10.6 Root Bridge & Root Ports:
Ensure the root bridge and root ports selection are optimal and as expected in the topology. In a managed FortiSwitch topology, one of the switches in the topmost tier/layer (i.e. which are directly connected to the FortiGate) should ideally be the root bridge. It is recommended to ensure the root bridge selection is triggered based on the configured root priority (lowest root priority value wins), instead of letting MAC address decide the root bridge. This is so that the topology is optimal and consistent, independent of any new switches added to the network whose MAC address could be lower than the current root bridge and hence could take over the role.
In a managed switch topology (i.e. FortiLink on FortiGate), if the topology has a combination of FortiSwitches and other vendor switches, the topmost tier/layer FortiSwitch (which is directly connected to the FortiSwitch) should ideally be the root bridge in the topology. Adjust the priority such that the correct switch becomes the root bridge. Also, ensure the root ports on each of the downstream switches are as expected (typically the upstream port).
In an MCLAG setup, ideally, the root port is the MLAG uplink (with the name '_FlInK1_MLAG0_') on all the non-root switches in the topology.
10.7 Automation stitching:
Use FortiGate automation stitching on the FortiGate (if using managed switch mode) to parse the logs for frequent and large TCNs in the topology. When identified, actions can be taken proactively to further troubleshoot. More details on how to use a log entry to create an automation stitch to trigger an alert are here - FortiGate Automation Stitching.
10.8 Upgrade Fortiswitch to supported & latest interim versions:
Keep the FortiSwitch firmware up to date by upgrading to the latest interim release available, since any known STP defect in the older versions will be addressed in the latest interim releases.
10.9 Compatibility issues:
Review the document below for information regarding the supported STP features, verify compatibility between the protocols and limitations. Refer to FortiSwitch - Configuring STP settings.
10.10 Shutdown edge ports not in use:
It is recommended to ensure the edge ports/access layer ports that are not in use be shutdown, and enabled only as necessary. This recommendation along with BPDU guard on the access layer ports helps with reducing STP issues.
10.11 Disable PoE status on ports where it is not needed:
PoE typically is needed for devices connecting to the access layer ports. Another best practice is to disable poe status on Core and Aggregation layer switches (enable as needed later), and have them enabled only on access layer switches. This helps in avoiding any PSE to PSE voltage injection issues. More details are available in Power Fault: Error Type 36 (Port is off: Voltage injection into the port)' on FortiSwitch.
10.12 MCLAG deployment:
If MCLAG is being used on the FortiSwitches, verify the following STP settings are configured as it is a prerequisite for MCLAG config:
Both of the above settings are enabled by default. Verify that they are not disabled. More details on this requirement are provided in FortiSwitch - Deploying MCLAG Topologies.
Related documents: Fortinet - Switching Reference Architecture Guide ForitSwitch Administration guide - STP |
Nice article. Congratulations.
The Fortinet Security Fabric brings together the concepts of convergence and consolidation to provide comprehensive cybersecurity protection for all users, devices, and applications and across all network edges.
Copyright 2024 Fortinet, Inc. All Rights Reserved.