| Solution |
Topology:

In this topology, the design of the GRE tunnel connection failover between FortiGate HA Cluster and AWS Transit Gateway is as below:
- When FGT01 becomes the primary, the GRE tunnel tgw will be active to route traffic between FortiGate and the application subnet 192.168.50.0/24.
- When FGT02 becomes the primary, the GRE tunnel tgw1 will be active to route traffic between FortiGate and the application subnet 192.168.50.0/24.
- Server Load Balancing was configured on FortiGate to perform DNAT to internal/application servers:
The connection worked fine when FGT02 remained the primary unit. The following statistics show the routing table, GRE tunnel status, and virtual server status when FGT02 acts as the primary:
FGTVM-Connect-2 # get router info routing-table all Codes: K - kernel, C - connected, S - static, R - RIP, B - BGP O - OSPF, IA - OSPF inter area N1 - OSPF NSSA external type 1, N2 - OSPF NSSA external type 2 E1 - OSPF external type 1, E2 - OSPF external type 2 i - IS-IS, L1 - IS-IS level-1, L2 - IS-IS level-2, ia - IS-IS inter area V - BGP VPNv4 * - candidate default Routing table for VRF=0 S* 0.0.0.0/0 [10/0] via 10.0.0.65, port1, [1/0] S 1.0.0.0/8 [10/0] via 10.0.0.81, port2, [1/0] C 10.0.0.64/28 is directly connected, port1 C 10.0.0.80/28 is directly connected, port2 C 169.254.102.0/29 is directly connected, tgw1 C 169.254.102.1/32 is directly connected, tgw1 C 169.254.120.0/29 is directly connected, tgwc C 169.254.120.1/32 is directly connected, tgwc S *> 172.20.1.0/24 [10/0] via IPSec1 tunnel x.x.x.x, [1/0] S *> 172.20.2.0/24 [10/0] via IPSec2 tunnel y.y.y.y, [1/0] B 192.168.50.0/24 [20/100] via 169.254.102.2 (recursive is directly connected, tgw1), 00:01:02, [1/0] B 192.168.100.0/24 [20/100] via 169.254.102.2 (recursive is directly connected, tgw1), 00:01:02, [1/0]
FGTVM-Connect-2 # diagnose sys gre list
IPv4:
vd=0 devname=tgw1 devindex=4 ifindex=31 saddr=10.0.0.86 daddr=1.0.0.69 rpdb=0 ref=0 key=0/0 flags=0/0 dscp-copy=0 diffservcode=000000 RX bytes:30217 (29.5 kb) TX bytes:14937 (14.5 kb); RX packets:348, TX packets:174, TX carrier_err:0 collisions:0 npu-info: asic_offload=0, enc/dec=0/0, enc_bk=0/0/0/0, dec_bk=0/0/0/0 rpdb-ver: ffffffff rpdb-gwy: 0.0.0.0 rpdb-oif: 0
vd=0 devname=tgwc devindex=4 ifindex=15 saddr=10.0.0.21 daddr=1.0.0.68 rpdb=0 ref=0 key=0/0 flags=0/0 dscp-copy=0 diffservcode=000000 RX bytes:200059 (195.3 kb) TX bytes:55811 (54.5 kb); RX packets:2358, TX packets:667, TX carrier_err:4 collisions:0 npu-info: asic_offload=0, enc/dec=0/0, enc_bk=0/0/0/0, dec_bk=0/0/0/0 rpdb-ver: ffffffff rpdb-gwy: 0.0.0.0 rpdb-oif: 0 total tunnel = 2
FGTVM-Connect-2 # diagnose firewall vip realserver list alloc=2 ------------------------------ vf=0 name=VS_UBUNTU/1 class=4 type=0 192.168.64.43:(22-22), protocol=6 total=1 alive=1 power=1 ptr=635488 ip=192.168.50.177-192.168.50.177/22 adm_status=0 holddown_interval=300 max_connections=0 weight=1 option=01 alive=1 total=1 enable=00000001 alive=00000001 power=1 src_sz=0 id=0 status=up ks=0 us=0 events=5 bytes=0 rtt=0
However, when FGT01 takes over as the primary role, the traffic fails, and no response is observed from the server:
FGTVM-Connect # diagnose sniffer packet any 'host 192.168.64.43' 4 0 l Using Original Sniffing Mode interfaces=[any] filters=[host 192.168.64.43] 2025-11-07 01:22:06.887152 IPSec1 in 172.20.1.1.55608 -> 192.168.64.43.22: syn 312950751 2025-11-07 01:22:08.634742 IPSec2 in 172.20.2.1.55371 -> 192.168.64.43.22: syn 1710839447 2025-11-07 01:22:09.907403 IPSec1 in 172.20.1.1.55608 -> 192.168.64.43.22: syn 312950751 2025-11-07 01:22:11.647050 IPSec2 in 172.20.2.1.55371 -> 192.168.64.43.22: syn 1710839447
In the flow debug log capture, it was observed that traffic from the IPSec tunnel enters the FortiGate without issue; however, FortiGate is not performing the DNAT when FGT01 takes over the primary role:
2025-11-07 01:06:20 id=65308 trace_id=98 func=print_pkt_detail line=6005 msg="vd-root:0 received a packet(proto=6, 172.20.1.1:54955->192.168.64.43:22) tun_id=x.x.x.x from IPSec1. flag [S], seq 707309299, ack 0, win 62727" 2025-11-07 01:06:20 id=65308 trace_id=98 func=ipsec_spoofed4 line=243 msg="src ip 172.20.1.1 match selector 0 range 0.0.0.0-255.255.255.255" 2025-11-07 01:06:20 id=65308 trace_id=98 func=init_ip_session_common line=6204 msg="allocate a new session-009af359" 2025-11-07 01:06:20 id=65308 trace_id=98 func=iprope_dnat_check line=5481 msg="in-[IPSec1], out-[]" 2025-11-07 01:06:20 id=65308 trace_id=98 func=iprope_dnat_tree_check line=824 msg="len=1" 2025-11-07 01:06:20 id=65308 trace_id=98 func=__iprope_check_one_dnat_policy line=5346 msg="checking gnum-100000 policy-16064" 2025-11-07 01:06:20 id=65308 trace_id=98 func=iprope_dnat_check line=5506 msg="result: skb_flags-02000008, vid-16064, ret-no-match, act-accept, flag-00000100" 2025-11-07 01:06:20 id=65308 trace_id=98 func=__vf_ip_route_input_rcu line=1989 msg="find a route: flag=80000000 gw-0.0.0.0 via root" 2025-11-07 01:06:20 id=65308 trace_id=98 func=iprope_access_proxy_check line=458 msg="in-[IPSec1], out-[], skb_flags-02000008, vid-16064" 2025-11-07 01:06:20 id=65308 trace_id=98 func=__iprope_check line=2404 msg="gnum-100017, check-ffffffffa002cb97" ... 2025-11-07 01:06:20 id=65308 trace_id=98 func=__iprope_check_one_policy line=2140 msg="checked gnum-10000e policy-4294967295, ret-no-match, act-accept" ... 2025-11-07 01:06:20 id=65308 trace_id=98 func=__iprope_check_one_policy line=2374 msg="policy-4294967295 is matched, act-drop" 2025-11-07 01:06:20 id=65308 trace_id=98 func=__iprope_check line=2421 msg="gnum-10000e check result: ret-matched, act-drop, flag-00000000, flag2-00000000" 2025-11-07 01:06:20 id=65308 trace_id=98 func=iprope_policy_group_check line=4903 msg="after check: ret-matched, act-drop, flag-00000000, flag2-00000000" 2025-11-07 01:06:20 id=65308 trace_id=98 func=__iprope_check line=2404 msg="gnum-10000f, check-ffffffffa002cb97" ... 2025-11-07 01:06:21 id=65308 trace_id=98 func=__iprope_check_one_policy line=2374 msg="policy-4294967295 is matched, act-drop" 2025-11-07 01:06:21 id=65308 trace_id=98 func=__iprope_check line=2421 msg="gnum-10000f check result: ret-matched, act-drop, flag-00000800, flag2-00000000" 2025-11-07 01:06:21 id=65308 trace_id=98 func=iprope_policy_group_check line=4903 msg="after check: ret-matched, act-drop, flag-00000800, flag2-00000000" 2025-11-07 01:06:21 id=65308 trace_id=98 func=fw_local_in_handler line=620 msg="iprope_in_check() check failed on policy 0, drop"
Further investigation shows that the real server is showing as down on the FortiGate when FGT01 takes the primary role:
FGTVM-Connect # diagnose firewall vip realserver list alloc=2 ------------------------------ vf=0 name=VS_UBUNTU/1 class=4 type=0 192.168.64.43:(22-22), protocol=6 total=1 alive=0 power=0 ptr=536555 ip=192.168.50.177-192.168.50.177/22 adm_status=0 holddown_interval=300 max_connections=0 weight=1 option=01 alive=0 total=1 enable=00000001 alive=00000000 power=0 src_sz=0 id=0 status=down ks=0 us=0 events=14 bytes=0 rtt=0
Further checking on the health-check configured for the VIP indicates that the source IP was configured:
config firewall ldb-monitor
edit "Ubuntu"
set type tcp
set port 22
set src-ip 10.0.0.86
next
end
The IP 10.0.0.86 is the IP configured for FGT02 GRE tunnel. Hence, removing the src-ip setting in the ldb-monitor configuration will resolve the issue.
It is also recommended to use a single GRE tunnel interface and ensure that system.gre-tunnel is being excluded from HA synchronization for seamless failover.
FGTVM-Connect-2 # show system gre-tunnel
config system gre-tunnel
edit "tgwc"
set interface "port2"
set remote-gw 1.0.0.69
set local-gw 10.0.0.86
next
end
FGTVM-Connect # show system gre-tunnel
config system gre-tunnel
edit "tgwc"
set interface "port2"
set remote-gw 1.0.0.68
set local-gw 10.0.0.21
next
end
FGTVM-Connect # show system vdom-exception
config system vdom-exception
edit 1
set object system.inteface
next
edit 2
set object router.static
next
edit 3
set object router.bgp
next
edit 4
set object system.gre-tunnel
next
end
|