FortiGate
FortiGate Next Generation Firewall utilizes purpose-built security processors and threat intelligence security services from FortiGuard labs to deliver top-rated protection and high performance, including encrypted traffic.
achacon
Staff
Staff
Article Id 418306
Description

This article describes the different states that an SD-WAN member interface can be in (according to Performance SLAs) and the timers that are used to determine how long a transition takes from Dead to Alive, as well as Out-of-SLA to In-SLA. Notably, these states are used to determine if an SD-WAN member will be used to handle outgoing traffic, and they also determine how much time must pass before the link will be utilized.

 

Important: the calculation of these transition timers differs very significantly between FortiOS v7.4 and earlier vs. FortiOS v7.6 and later. In particular, the amount of time required for transitioning from Out-of-SLA to In-SLA changes significantly and is discussed further below.

Scope FortiGate (v7.6.3 and earlier, v7.6.4 and later), SD-WAN.
Solution

SD-WAN Rules on the FortiGate have two separate sets of states used to determine if the interface is eligible to be used for traffic forwarding: Dead or Alive and Out-of-SLA or In-SLA. Both sets of states are determined using SD-WAN Performance SLAs (aka health checks), but are evaluated using different mechanisms within the health checks:

 

Dead/Alive indicates the state of the SD-WAN member's upstream network connectivity, according to probes sent out from the Performance SLA/health check. If no responses are received multiple times in a row, then the interface is set to the Dead state; otherwise, the interface is considered to be Alive.

 

Out-of-SLA/In-SLA indicates if an SD-WAN member is exceeding the acceptable performance thresholds set for latency, jitter, and/or packet-loss (which are measured using the probes sent out by the Performance SLA. If these metrics exceed the SLA threshold, then an interface is considered to be Out-of-SLA (i.e., link quality is considered to be poorer than expected), otherwise, it is considered to be In-SLA if it is within the set threshold.

 

Important: SD-WAN members can have separate states for each set. For example, an interface can be both Alive and also Out-of-SLA if the upstream health check server is reachable, but the measured latency exceeds the threshold set in the Performance SLA. Additionally:

  • Dead SD-WAN members will not be used for forwarding traffic, only Alive members.
  • Dead members are also generally considered to be Out-of-SLA (i.e., it is not possible to have a Dead + In-SLA member).
  • Lowest Cost (SLA) SD-WAN rules will prefer Alive + In-SLA interfaces over Alive + Out-of-SLA. However, Manual and Best Quality rules do not take SLA state into account (only Dead/Alive status).

 

Calculating state transition time:

For reference, state transition timers are based on the following crucial parameters, which are configured within the SD-WAN Performance SLA settings:

  • interval - time interval (in milliseconds) in-between sending probes to the health check target server.
  • failtime - number of probe failures in a row that must occur before a server is considered lost.
  • recoverytime - number of successful probe responses that must be received before a server is considered recovered.

 

config system sdwan

    config health-check

        edit <name>

            set server <IP/FQDN>

            set interval <20 - 3600, default = 500 milliseconds>

            set failtime <1 - 3600, default = 5>

            set recoverytime <1 - 3600, default = 5>

        next

    end

end

 

Important: in FortiOS v7.6.3 and earlier, the interval and recoverytime settings impact both the Dead to Alive transition timer (#2) and the Out-of-SLA to In-SLA transition timer (#4). Be cautious of setting these to high values, as they can add significant delays to the amount of time that must pass before an SD-WAN interface fully recovers to a state of Alive + In-SLA.

 

The following state transition timers utilize a combination of the above factors, as described by the following sections:

 

  1. Transition from Alive to Dead.

The following calculation can be used to determine the number of probe failures and the period of time that must pass for an SD-WAN member to be marked as Dead. Note that the failures must be consecutive to trigger this transition:

 

Alive to Dead Timer (in seconds) = failtime * interval/1000

 

  1. Transition back from Dead to Alive.

To transition back from the Dead state to the Alive state, use the following calculation. Like the Alive to Dead transition, the FortiGate must receive multiple successful probe responses in a row to trigger this transition:

 

Dead to Alive Timer (in seconds) = recoverytime * interval/1000

 

  1. Transition from In-SLA to Out-of-SLA.

To transition from In-SLA to Out-of-SLA, the SD-WAN member interface must a) meet/exceed the configured SLA threshold (i.e., too much latency, jitter, and/or packet-loss), and also b) continue to exceed the SLA threshold for an additional period of time (as opposed to briefly exceeding and then falling below the threshold). Note that this scenario applies to cases where the member interface does not go to the Dead state.

 

First, the FortiGate must calculate a measurement for the latency, jitter, and packet-loss metrics. Each metric is an average based on a sliding window of the most recent health check probes:

  • Latency threshold: Calculated based on last 30 probes (default SLA threshold = 5ms)**.
  • Jitter threshold: Calculated based on last 30 probes (default SLA threshold = 5ms)**.
  • Packet Loss threshold: Calculated based on the last 100 probes (default SLA threshold = 0%).

 

**Adjustable via the probe-count option found in the Performance SLA settings (affects latency and jitter only, see config system sdwan for more information).

 

To exceed the SLA threshold, the FortiGate must receive enough poor results from the health check probes to increase the measurement. For latency and jitter, this can be difficult to calculate exactly since it depends on how significantly the measurements change from baseline, but packet-loss is much simpler to calculate since it is measured simply in terms of successful vs. failed probe responses:

 

Time to reach packet-loss threshold (in seconds) = |packetloss-threshold - CPL| * interval/1000,

where CPL = Current Packet Loss measurement

 

Once the SLA threshold is initially exceeded, it must continue to be exceeded for a period approximately equal to the following calculation:

 

SLA threshold-exceeded timer (in seconds) = failtime * interval/1000

 

Note: The 'Time to reach packet-loss threshold' timer calculates the absolute value of packetloss-threshold minus the current packet-loss measured, so it can be used for both increasing and decreasing packet-loss approaching the set threshold.

 

Additionally, packet-loss percentage is calculated based on the results of the past 100 health check probes, so ‘bad’ probe results must be fully replaced with ‘good’ results for the packet-loss percentage to decrease. To transition from non-zero packet-loss to 0% packet-loss, use the following calculation:

 

Non-zero to 0% packet-loss (in seconds) = 100 * interval/1000

 

For example, if 4 probe failures are received initially, then packet-loss will register at 4% and remain at this level until 96 probe successes have been received, after which packet-loss will tick downward 1% at a time until a total of 100 probe successes are received in total. Note as well that probes are always being sent regardless of the interface being Alive or Dead, and so measured packet-loss can start to decrease even when the interface is in the Dead state.

 

  1. Transition from Out-of-SLA to In-SLA.

Important: In FortiOS v7.4 and earlier, a timer delay exists when transitioning from Out-of-SLA to In-SLA that is separate and occurs in parallel to the actual calculation of the SLA metrics. This timer is similar to but fully separate from the Dead to Alive timer described in Section 2 above, and is calculated as follows:

 

Out-of-SLA to In-SLA delay timer (in seconds) = recoverytime * interval/1000

 

This timer is triggered when the member interface is both Alive and SLA metrics are below thresholds, and the purpose of this timer is to prevent the SD-WAN member from flapping between SLA states. This timer is not typically an issue when using relatively short values for recoverytime and interval, but setting these values too high can result in excessive delays (see Scenario 2 in the Conclusion section below).

 

As of FortiOS v7.6.4 and later (and Change #1142171), this timer was optimized so that a member that has transitioned from Dead to Alive will immediately go to In-SLA as soon as the metrics fall below the threshold, rather than needing to also wait for the Out-of-SLA to In-SLA delay timer.

 

Demonstration:

The following example will demonstrate the timers involved with an interface transitioning through the following states: Alive -> Dead -> Alive + Out-of-SLA -> Alive + In-SLA.

 

The following Performance SLA configuration will be used as an example to demonstrate how all of the above timers function. Assume that packet-loss goes to 100% during the initial Alive to Dead transition:

 

config health-check

    edit 'Example_SLA'

        set server '8.8.8.8'

        set interval 2000

        set failtime 2

        set recoverytime 60

            config sla

                edit 1

                    set link-cost-factor packet-loss 

                    set packetloss-threshold 15

                next

            end

    next

end

 

  1. Transition from Alive to Dead.

With a failtime of 2 and an interval of 2000, it will take 2 consecutive probe successes in a 4-second-long period for the interface to be considered Dead.

 

Alive to Dead Timer (in seconds) = failtime * interval/1000 = 2 * 2000/1000 = 4 seconds

 

  1. Transition from Dead to Alive + Out-of-SLA.

With a recoverytime of 60 and an interval of 2000, it will take 60 consecutive probe successes in a 120-second-long period for the interface to transition from Dead to Alive + Out-of-SLA:

 

Dead to Alive Timer (in seconds) = recoverytime * interval/1000 = 60 * 2000/1000 = 120 seconds

 

  1. Transition from Alive + Out-of-SLA to Alive + In-SLA.

To transition from Alive + Out-of-SLA to Alive + In-SLA, the SLA metric (in this case, packet-loss) must first fall below the configured threshold of 15% packet-loss. Notably, the packet-loss counter is always being measured (even during the Dead state), and so the measured packet-loss will actually tick downward during the previous Step 2 and could potentially drop below the SLA threshold, depending on the starting conditions and the configured settings.

 

With a failtime of 2, an interval of 2000ms, and a packetloss-threshold of 15, it will require 85 consecutive probe successes (170 seconds total) for packet-loss to go from 100% down to below 15%:

 

Time to reach packet-loss threshold (in seconds) = |packetloss-threshold - CPL| * interval/1000 = |15 - 100| * 2000/1000 = 85 * 2 = 170 seconds

 

As soon as the measured packet-loss falls below the SLA threshold, the 'Out-of-SLA to In-SLA delay timer' is triggered. As noted in Section 4 above, a recoverytime of 60 and an interval of 2000 results in a flat timer of 120 seconds that must expire before the interface may be marked as In-SLA (v7.6.3 and earlier):

 

Out-of-SLA to In-SLA delay timer (in seconds) = recoverytime * interval/1000 = 60 * 2000/1000 = 120 seconds

 

Diagrams and Visualization:

The following diagram displays the timers in sequence using the same settings from the above demonstration:

 

Scenario 1: interval=2000, failtime=2, recoverytime=60, packetloss-threshold=15%Scenario 1: interval=2000, failtime=2, recoverytime=60, packetloss-threshold=15%

 

As described above, the interface transitions from Alive to Dead within 2 failed probes (4 seconds), and it recovers from Dead to Alive + Out-of-SLA within roughly 120 seconds. Manual and Best Quality rules may use this interface since it is Alive, but Lowest Cost (SLA) rules would generally not, since it is Out-of-SLA (packet-loss is still not quite below the SLA threshold).

 

Once the measured packet-loss is below the SLA threshold, the 'Out-of-SLA to In-SLA delay timer' is started. This results in a 120-second period where packet-loss is below the SLA threshold, but the member interface is not actually considered as In-SLA from an SD-WAN rules perspective until the timer has elapsed.

 

For comparison, consider this alternative scenario that uses the same settings except for recoverytime, which is increased to a value of 90 instead of 60:

 

Scenario 2: interval=2000, failtime=2, recoverytime increased to 90, packetloss-threshold=15%Scenario 2: interval=2000, failtime=2, recoverytime increased to 90, packetloss-threshold=15%

 

 

 

With an increased recoverytime, the 'Dead to Alive Timer' and 'Out-of-SLA to In-SLA delay timer' both increase to 180 seconds. Notably, packet-loss in this scenario actually drops to below the 15% threshold while the interface is still Dead, and so the 'Out-of-SLA to In-SLA delay timer' is started as soon as the interface transitions from Dead to Alive + Out-of-SLA. This is important to understand because setting recoverytime and interval to excessively large values can result in scenarios where SLA metrics are well below thresholds, and yet the interface may not be usable for Lowest-Cost (SLA) rules since it is not considered to be In-SLA yet.

 

One more alternative scenario to consider is the case of an SD-WAN member staying Alive but transitioning from In-SLA to Out-of-SLA and then back (this is a scenario where the 'threshold-exceeded timer' comes into play). The following scenario has a significantly different set of settings, with interval=2000, failtime increased to 20, recoverytime set back to 60, and packetloss-threshold decreased to 5%:

 

Scenario 3: interval=2000, failtime increased to 20, recoverytime set back to 60, and packetloss-threshold decreased to 5%Scenario 3: interval=2000, failtime increased to 20, recoverytime set back to 60, and packetloss-threshold decreased to 5%

 

Note how measured packet-loss exceeds the 5% threshold for a significant period of time (40 seconds) due to the 'threshold-exceeded timer'. As a reminder, this timer is failtime * interval/1000, and so setting an excessively high failtime can result in delays before an interface is transitioned from In-SLA to Out-of-SLA.

 

Reminder regarding FortiOS v7.6.4 and later:

As a final reminder, FortiOS v7.6.4 and later have removed the 'Out-of-SLA to In-SLA delay timer' when transitioning back from Dead to Alive + Out-of-SLA, so as soon as an interface is both Alive and has measured SLA metrics (latency, jitter, packet-loss) below the configured thresholds, it is immediately transitioned to Alive + In-SLA.

 

Related documents:

Link Health Monitor

Monitoring performance SLA

Technical Tip: Understanding SLA Target Map