| Description |
The issue of 'only one tunnel active/others down' when using SLA ping checks in SD-WAN over IPsec is typically not caused by the tunnel itself. Instead, it stems from how monitoring packets are sent and received, and how the kernel resolves the next-hop for each member.
Symptom:
- The system has two or more IPsec tunnels to the same destination, aggregated in SD-WAN.
- A health-check (SLA) is configured using ping to a remote IP.
- In the GUI/CLI, only one member appears; the others are marked down, and their SD-WAN routes show as inactive in the routing table.
- Despite this, real traffic failover works, triggered by other probes or complete link loss.
|
| Solution |
Root Cause: SLA requires per-member symmetry: The SD-WAN probes flow per member. For a member to be considered 'up', the ICMP echo reply must return through the same member that sent it. If the reply comes back through a different tunnel, the original member doesn’t “see” its response and marks itself down; this is a return path asymmetry.
Even if the interface is forced per member, the probe’s source IP (if not explicitly) might be:
- The primary IP of the interface
- A system IP that isn’t routable through that tunnel from the remote site’s perspective.
Result:
The remote site replies via a different tunnel, breaking symmetry.
Incorrect next-hop (gateway) configuration: For SD-WAN members using IPsec interfaces, the gateway should be the remote peer IP of the /30 or /31 subnet, or left undefined. If the gateway is mistakenly set to the local IP of the tunnel (or an IP that resolves via the same member), it creates recursion or a logical loop.
Probes from one member may exit or return through another, or not exit at all due to unresolved gateways. That is why only one member shows as up, while the rest appear down/inactive.
Solution
- Set a correct source IP per member (set source x.x.x.x).
- Set the gateway to the remote peer IP of the tunnel or leave it unset
- set source x.x.x.x: By assigning a stable source IP per member (e.g., a local LAN IP or the local /30 IP of the tunnel), all probes from that member use the same source.
The remote site has clear routing to that IP (via BGP/static routes or policy), so it replies through the same tunnel, the member sees its reply, and stays 'UP'.
Key point:
Without a per-member source, the remote site may choose any tunnel for the reply.
- set gateway x.x.x.x (remote peer) or unset gateway: This avoids recursion. The kernel can resolve the member’s next-hop without pointing to itself, ensuring the probe exits through the correct member. Do not use the local tunnel IP as a gateway; it is not a valid next-hop (it’s the device itself). This causes recursion/looping and makes the route inactive or causes probes to spill over to another member.
|