I've got a really strange issue that we've spent a week on and haven't been able to get anywhere.
Here are the specs:
FortiGate 600C running 5.2.2 in a HA Active-Active
Connected to Cisco 3560X switches with LACP aggregate interfaces
We recently switched from Watchguard to Fortigate firewalls in our web environment. In our web stack, we use NAT Reflection (or NAT Hairpin) to simplify DNS management. So, internal servers (CentOS) call out to to external VIP addresses that get NAT'd back into servers on the same subnet. I know that this isn't a great thing to do, but it's worked for years and we're working to change this soon.
As soon as we switched over to the Fortigate, we started getting timeouts with requests that follow that path (App Server>VIP>Web Server). What we are seeing from the App server side is that it sends a SYN, but never gets a SYN-ACK back. The Web Server never receives the packet either. I've done a trace on the Fortigate and I'm pretty sure that the Fortigate does receive the packet, but I'm unable to tell in the trace if it's actually responding correctly. The packet seems to get lost somewhere between the app servers and Web server. The issue is random and it does not seem to increase or decrease based on load. It might happen once every hundred requests or so. When it happens, the App server will eventually retry the request and it sometimes hangs again, but it will eventually go through - sometimes up to 90 seconds later.
Here is another thing that we've noticed. We've started to see Output Drops on the switch interfaces that connect to the Fortigate. As far as we know, this was not happening before, but we did not monitor it before so hard to know. We've changes cables to make sure it's not bad cables. We've also swapped to the secondary switch and see the same thing. One other thing to note is that this does not affect traffic coming into the Web servers from external. It only affects traffic that take the NAT hairpin loop. I have suspicions that it's one of two things. 1) The Fortigate is performing the NAT and then somehow losing the VLAN tag when it puts it back out on the network. This might explain why the switch is dropping the packet if it didn't have a VLAN tag and didn't know what to do with it. 2) It might be some type of MTU issue. Our switches and firewalls are configured with the default MTU of 1500.
We've done so many things at this point that we're just about out of ideas. Here is a list of things we've tried. Some of these are based from findings from these forums.
1. Reboot firewalls
2. Shutdown secondary firewall so it's running in standalone
3. Run on secondary firewall only
4. Moved policy to top of list
5. Disable vlanforward on aggregate interface and vlan interface
6. Swapped cables between firewalls and switches
7. Enabled send-deny-packet on specific policy
8. Set tcp-mss-sender and tcp-mss-receiver to 1380 on specific policy
9. Set tcp-mss to 1380 on vlan interface and aggregate interface
I've tried to include every config I can think of below. I would appreciate help if anyone can think of anything.
config system interface edit "aggr.webprod.in" set vdom "webprod" set type aggregate set tcp-mss 1380 set member "port17" "port18" set snmp-index 71 next config system interface edit "vlan.webprod.in" set vdom "webprod" set ip 172.XXX.XXX.XXX 255.255.0.0 set allowaccess ping set tcp-mss 1380 set snmp-index 74 set secondary-IP enable set interface "aggr.webprod.in" set vlanid 55 config secondaryip edit 1 set ip 172.XXX.XXX.XXX 255.255.0.0 set allowaccess ping next edit 2 set ip 172.XXX.XXX.XXX 255.255.0.0 set allowaccess ping next end nextend config firewall policy edit 58 set srcintf "zone.webint" set dstintf "zone.webint" set srcaddr "all" set dstaddr "vip.http.aaa" "vip.http.bbb "vip.http.ccc" "vip.https.aaa" "vip.https.bbb" "vip.https.ccc" set action accept set schedule "always" set service "HTTP" "HTTPS" "DNS" set logtraffic all set match-vip enable set tcp-mss-sender 1360 set tcp-mss-receiver 1360 set timeout-send-rst enable set nat enable nextend config firewall vip edit "vip.http.aaa" set extip 67.XXX.XXX.21-67.XXX.XXX.23 set extintf "any" set portforward enable set mappedip "172.XXX.XXX.1-172.XXX.XXX.3" set extport 80 set mappedport 80 next edit "vip.http.bbb" set extip 67.XXX.XXX.25-67.XXX.XXX.27 set extintf "any" set portforward enable set mappedip "172.XXX.XXX.1-172.XXX.XXX.3" set extport 80 set mappedport 80 next edit "vip.http.ccc" set extip 67.XXX.XXX.28-67.XXX.XXX.30 set extintf "any" set portforward enable set mappedip "172.XXX.XXX.1-172.XXX.XXX.3" set extport 80 set mappedport 80 next edit "vip.https.aaa" set extip 67.XXX.XXX.21-67.XXX.XXX.23 set extintf "any" set portforward enable set mappedip "172.XXX.XXX.1-172.XXX.XXX.3" set extport 443 set mappedport 443 next edit "vip.https.bbb" set extip 67.XXX.XXX.25-67.XXX.XXX.27 set extintf "any" set portforward enable set mappedip "172.XXX.XXX.1-172.XXX.XXX.3" set extport 443 set mappedport 443 next edit "vip.https.ccc" set extip 67.XXX.XXX.25-67.XXX.XXX.27 set extintf "any" set portforward enable set mappedip "172.XXX.XXX.1-172.XXX.XXX.3" set extport 443 set mappedport 443 nextend #############Cisco Configuration############# interface Port-channel10 description ptn-fw101 webprod-int portchannel switchport trunk encapsulation dot1q switchport trunk allowed vlan 50,51,55 switchport mode trunkend interface GigabitEthernet0/17 description fw101 port 17 webprod-int switchport trunk encapsulation dot1q switchport trunk allowed vlan 50,51,55 switchport mode trunk logging event bundle-status logging event spanning-tree spanning-tree portfast trunk spanning-tree bpdufilter enable channel-group 10 mode activeendinterface GigabitEthernet0/18 description fw101 port 17 webprod-int switchport trunk encapsulation dot1q switchport trunk allowed vlan 50,51,55 switchport mode trunk logging event bundle-status logging event spanning-tree spanning-tree portfast trunk spanning-tree bpdufilter enable channel-group 10 mode activeend
Nominating a forum post submits a request to create a new Knowledge Article based on the forum post topic. Please ensure your nomination includes a solution within the reply.
Here is a little more detail. We were able to get some additional traces tonight and determine that the Firewall is getting the packet as soon as the host sends them.
######## This is what a good packet trace looks like
213.037036 vlan.webprod.in in 172.XXX.XXX.4.52661 -> 67.XXX.XXX.22.80: syn 3512656144
213.037299 vlan.webprod.in out 67.XXX.XXX.22.80 -> 172.XXX.XXX.4.52661: syn 218656840 ack 3512656145 213.037300 aggr.webprod.in out 67.XXX.XXX.22.80 -> 172.XXX.XXX.4.52661: syn 218656840 ack 3512656145 213.037301 port18 out 67.XXX.XXX.22.80 -> 172.XXX.XXX.4.52661: syn 218656840 ack 3512656145 213.037525 vlan.webprod.in in 172.XXX.XXX.4.52661 -> 67.XXX.XXX.22.80: ack 218656841 213.037539 vlan.webprod.in in 172.XXX.XXX.4.52661 -> 67.XXX.XXX.22.80: psh 3512656145 ack 218656841 213.045640 vlan.webprod.in in 172.XXX.XXX.4.52661 -> 67.XXX.XXX.22.80: fin 3512656416 ack 218668089 213.045882 vlan.webprod.in out 67.XXX.XXX.22.80 -> 172.XXX.XXX.4.52661: fin 218668089 ack 3512656417 213.045883 aggr.webprod.in out 67.XXX.XXX.22.80 -> 172.XXX.XXX.4.52661: fin 218668089 ack 3512656417 213.045884 port18 out 67.XXX.XXX.22.80 -> 172.XXX.XXX.4.52661: fin 218668089 ack 3512656417
######## This is what a bad packet trace looks like
120.071166 vlan.webprod.in in 172.XXX.XXX.4.52661 -> 67.XXX.XXX.22.80: syn 3512656144
123.069771 vlan.webprod.in in 172.XXX.XXX.4.52661 -> 67.XXX.XXX.22.80: syn 3512656144 129.067483 vlan.webprod.in in 172.XXX.XXX.4.52661 -> 67.XXX.XXX.22.80: syn 3512656144 141.063126 vlan.webprod.in in 172.XXX.XXX.4.52661 -> 67.XXX.XXX.22.80: syn 3512656144 165.054426 vlan.webprod.in in 172.XXX.XXX.4.52661 -> 67.XXX.XXX.22.80: syn 3512656144
Hello, Randomly doesn't work means that the configuration should not be a problem here. However I would like to know if you have dual ISPs and also the routing table when the issue occurs(and also the working one) with command "get router info routing-table database" May I know the IP address used in the sniffer filter? Also, please get the output of the debug flow commands which tells you what is Fortigate is doing with any specific request and reason for dropping it(if it does): diag debug reset diag debug disable diag debug enable diag debug flow filter saddr x.x.x.x --->> Source address from where the connection is initiated (If you do not have too many connections to the server during the test, I recommend using the filter 'daddr with server IP') diag debug flow filter dport 80 diag debug flow show console enable diag debug console timestamp enable diag debug flow trace start 100 NOTE: - Once the commands are run, try to access the server - Once you get the output captured, you can disable the debug with the command #diag debug disable
Cheers!
Thanks, I'll try to upload that ASAP.
We discovered one new thing today after sniffing packets. We have 8 application servers that sit behind the Fortigate and they all could be sending many requests up through these VIPs at any given time. We are able to identify the timeout issue easily in Wireshark because when it happens we get a "TCP Port numbers reused" followed by several "TCP Retransmission". What it looks like is that multiple application servers are sending requests around the same time with the same source port.
I could be totally off here, but it seems like the Fortigate is having trouble processing that correctly from a NAT standpoint. This might explain why it only affects Internal>FW>Internal traffic rather than WAN>FW>Internal since traffic from the WAN side would always be coming from a different IP.
We think we finally have this one fixed. We created a Dynamic IP Pool with 100 IP addresses and chose that IP pool on the policy rather than "Use Outgoing Interface Address". We only enabled this IP pool for the policy for Internal>FW>Internal policy and not for WAN>FW>Internal policy.
As soon as we made this change, the timeouts stopped. The only thing I can determine is that there is a bug in the Fortigate where it cannot properly handle this scenario when there are several (we have 8) internal hosts using the VIP. The Watchguard firewalls that we had in place before did not have this problem and firewall was the only thing that changed in our setup.
Just to summarize - the issue occurs in a NAT Reflection scenario where there are multiple internal servers sending traffic to a VIP that forwards traffic back to internal servers on the same subnet. Eventually, multiple servers will send a request from the same port number within a few seconds of each other and that can cause the second request to timeout. When following a trace, we can see that the server sends a SYN packet that appears to make it through the FW and to the other server, but no ACK is ever returned. We will then see multiple SYN retransmissions until it finally times out.
Did you ever hear back from Fortinet on this? This is still an issue on 5.2.7.
Select Forum Responses to become Knowledge Articles!
Select the “Nominate to Knowledge Base” button to recommend a forum post to become a knowledge article.
The Fortinet Security Fabric brings together the concepts of convergence and consolidation to provide comprehensive cybersecurity protection for all users, devices, and applications and across all network edges.
Copyright 2024 Fortinet, Inc. All Rights Reserved.