Fortinet Forum
The Forums are a place to find answers on a range of Fortinet products from peers and product experts.
Eric_Lackey
New Contributor III

HTTP request timeouts when going through Virtual IP (NAT Reflection, NAT Hairpin)

I've got a really strange issue that we've spent a week on and haven't been able to get anywhere.

 

Here are the specs:

FortiGate 600C running 5.2.2 in a HA Active-Active

Connected to Cisco 3560X switches with LACP aggregate interfaces

 

We recently switched from Watchguard to Fortigate firewalls in our web environment. In our web stack, we use NAT Reflection (or NAT Hairpin) to simplify DNS management. So, internal servers (CentOS) call out to to external VIP addresses that get NAT'd back into servers on the same subnet. I know that this isn't a great thing to do, but it's worked for years and we're working to change this soon.

 

As soon as we switched over to the Fortigate, we started getting timeouts with requests that follow that path (App Server>VIP>Web Server). What we are seeing from the App server side is that it sends a SYN, but never gets a SYN-ACK back. The Web Server never receives the packet either. I've done a trace on the Fortigate and I'm pretty sure that the Fortigate does receive the packet, but I'm unable to tell in the trace if it's actually responding correctly. The packet seems to get lost somewhere between the app servers and Web server. The issue is random and it does not seem to increase or decrease based on load. It might happen once every hundred requests or so. When it happens, the App server will eventually retry the request and it sometimes hangs again, but it will eventually go through - sometimes up to 90 seconds later.

 

Here is another thing that we've noticed. We've started to see Output Drops on the switch interfaces that connect to the Fortigate. As far as we know, this was not happening before, but we did not monitor it before so hard to know. We've changes cables to make sure it's not bad cables. We've also swapped to the secondary switch and see the same thing. One other thing to note is that this does not affect traffic coming into the Web servers from external. It only affects traffic that take the NAT hairpin loop. I have suspicions that it's one of two things. 1) The Fortigate is performing the NAT and then somehow losing the VLAN tag when it puts it back out on the network. This might explain why the switch is dropping the packet if it didn't have a VLAN tag and didn't know what to do with it. 2) It might be some type of MTU issue. Our switches and firewalls are configured with the default MTU of 1500.

 

 

We've done so many things at this point that we're just about out of ideas. Here is a list of things we've tried. Some of these are based from findings from these forums.

1. Reboot firewalls

2. Shutdown secondary firewall so it's running in standalone

3. Run on secondary firewall only

4. Moved policy to top of list

5. Disable vlanforward on aggregate interface and vlan interface

6. Swapped cables between firewalls and switches

7. Enabled send-deny-packet on specific policy

8. Set tcp-mss-sender and tcp-mss-receiver to 1380 on specific policy

9. Set tcp-mss to 1380 on vlan interface and aggregate interface

 

I've tried to include every config I can think of below. I would appreciate help if anyone can think of anything.

 

config system interface

    edit "aggr.webprod.in"

        set vdom "webprod"

        set type aggregate

        set tcp-mss 1380

        set member "port17" "port18"

        set snmp-index 71

    next

 

 

config system interface

    edit "vlan.webprod.in"

        set vdom "webprod"

        set ip 172.XXX.XXX.XXX 255.255.0.0

        set allowaccess ping

        set tcp-mss 1380

        set snmp-index 74

        set secondary-IP enable

        set interface "aggr.webprod.in"

        set vlanid 55

            config secondaryip

                edit 1

                    set ip 172.XXX.XXX.XXX 255.255.0.0

                    set allowaccess ping

                next

                edit 2

                    set ip 172.XXX.XXX.XXX 255.255.0.0

                    set allowaccess ping

                next

            end

    next

end

 

config firewall policy

    edit 58

        set srcintf "zone.webint"

        set dstintf "zone.webint"

        set srcaddr "all"

        set dstaddr "vip.http.aaa" "vip.http.bbb "vip.http.ccc" "vip.https.aaa" "vip.https.bbb" "vip.https.ccc" 

        set action accept

        set schedule "always"

        set service "HTTP" "HTTPS" "DNS"

        set logtraffic all

        set match-vip enable

        set tcp-mss-sender 1360

        set tcp-mss-receiver 1360

        set timeout-send-rst enable

        set nat enable

    next

end

 

 

 

 

config firewall vip

    edit "vip.http.aaa"

        set extip 67.XXX.XXX.21-67.XXX.XXX.23

        set extintf "any"

        set portforward enable

        set mappedip "172.XXX.XXX.1-172.XXX.XXX.3"

        set extport 80

        set mappedport 80

    next

    edit "vip.http.bbb"

        set extip 67.XXX.XXX.25-67.XXX.XXX.27

        set extintf "any"

        set portforward enable

        set mappedip "172.XXX.XXX.1-172.XXX.XXX.3"

        set extport 80

        set mappedport 80

    next

    edit "vip.http.ccc"

        set extip 67.XXX.XXX.28-67.XXX.XXX.30

        set extintf "any"

        set portforward enable

        set mappedip "172.XXX.XXX.1-172.XXX.XXX.3"

        set extport 80

        set mappedport 80

    next

    edit "vip.https.aaa"

        set extip 67.XXX.XXX.21-67.XXX.XXX.23

        set extintf "any"

        set portforward enable

        set mappedip "172.XXX.XXX.1-172.XXX.XXX.3"

        set extport 443

        set mappedport 443

    next

    edit "vip.https.bbb"

        set extip 67.XXX.XXX.25-67.XXX.XXX.27

        set extintf "any"

        set portforward enable

        set mappedip "172.XXX.XXX.1-172.XXX.XXX.3"

        set extport 443

        set mappedport 443

    next

    edit "vip.https.ccc"

        set extip 67.XXX.XXX.25-67.XXX.XXX.27

        set extintf "any"

        set portforward enable

        set mappedip "172.XXX.XXX.1-172.XXX.XXX.3"

        set extport 443

        set mappedport 443

    next

end

 

 

#############

Cisco Configuration

#############

 

interface Port-channel10

 description ptn-fw101 webprod-int portchannel

 switchport trunk encapsulation dot1q

 switchport trunk allowed vlan 50,51,55

 switchport mode trunk

end

 

interface GigabitEthernet0/17

 description fw101 port 17 webprod-int

 switchport trunk encapsulation dot1q

 switchport trunk allowed vlan 50,51,55

 switchport mode trunk

 logging event bundle-status

 logging event spanning-tree

 spanning-tree portfast trunk

 spanning-tree bpdufilter enable

 channel-group 10 mode active

end

interface GigabitEthernet0/18

 description fw101 port 17 webprod-int

 switchport trunk encapsulation dot1q

 switchport trunk allowed vlan 50,51,55

 switchport mode trunk

 logging event bundle-status

 logging event spanning-tree

 spanning-tree portfast trunk

 spanning-tree bpdufilter enable

 channel-group 10 mode active

end

5 REPLIES 5
Eric_Lackey
New Contributor III

Here is a little more detail. We were able to get some additional traces tonight and determine that the Firewall is getting the packet as soon as the host sends them.

 

 

######## This is what a good packet trace looks like

 

213.037036 vlan.webprod.in in 172.XXX.XXX.4.52661 -> 67.XXX.XXX.22.80: syn 3512656144 

213.037299 vlan.webprod.in out 67.XXX.XXX.22.80 -> 172.XXX.XXX.4.52661: syn 218656840 ack 3512656145 213.037300 aggr.webprod.in out 67.XXX.XXX.22.80 -> 172.XXX.XXX.4.52661: syn 218656840 ack 3512656145 213.037301 port18 out 67.XXX.XXX.22.80 -> 172.XXX.XXX.4.52661: syn 218656840 ack 3512656145 213.037525 vlan.webprod.in in 172.XXX.XXX.4.52661 -> 67.XXX.XXX.22.80: ack 218656841 213.037539 vlan.webprod.in in 172.XXX.XXX.4.52661 -> 67.XXX.XXX.22.80: psh 3512656145 ack 218656841 213.045640 vlan.webprod.in in 172.XXX.XXX.4.52661 -> 67.XXX.XXX.22.80: fin 3512656416 ack 218668089 213.045882 vlan.webprod.in out 67.XXX.XXX.22.80 -> 172.XXX.XXX.4.52661: fin 218668089 ack 3512656417 213.045883 aggr.webprod.in out 67.XXX.XXX.22.80 -> 172.XXX.XXX.4.52661: fin 218668089 ack 3512656417 213.045884 port18 out 67.XXX.XXX.22.80 -> 172.XXX.XXX.4.52661: fin 218668089 ack 3512656417

 

######## This is what a bad packet trace looks like

 

120.071166 vlan.webprod.in in 172.XXX.XXX.4.52661 -> 67.XXX.XXX.22.80: syn 3512656144 

123.069771 vlan.webprod.in in 172.XXX.XXX.4.52661 -> 67.XXX.XXX.22.80: syn 3512656144 129.067483 vlan.webprod.in in 172.XXX.XXX.4.52661 -> 67.XXX.XXX.22.80: syn 3512656144 141.063126 vlan.webprod.in in 172.XXX.XXX.4.52661 -> 67.XXX.XXX.22.80: syn 3512656144 165.054426 vlan.webprod.in in 172.XXX.XXX.4.52661 -> 67.XXX.XXX.22.80: syn 3512656144

vjoshi_FTNT
Staff
Staff

Hello, Randomly doesn't work means that the configuration should not be a problem here. However I would like to know if you have dual ISPs and also the routing table when the issue occurs(and also the working one) with command  "get router info routing-table database" May I know the IP address used in the sniffer filter? Also, please get the output of the debug flow commands which tells you what is Fortigate is doing with any specific request and reason for dropping it(if it does): diag debug reset diag debug disable diag debug enable diag debug flow filter saddr x.x.x.x       --->> Source address from where the connection is initiated (If you do not have too many connections to the server during the test, I recommend using the filter 'daddr with server IP') diag debug flow filter dport 80 diag debug flow show console enable diag debug console timestamp enable diag debug flow trace start 100   NOTE: - Once the commands are run, try to access the server - Once you get the output captured, you can disable the debug with the command  #diag debug disable

Cheers!

Eric_Lackey
New Contributor III

Thanks, I'll try to upload that ASAP.

 

We discovered one new thing today after sniffing packets. We have 8 application servers that sit behind the Fortigate and they all could be sending many requests up through these VIPs at any given time. We are able to identify the timeout issue easily in Wireshark because when it happens we get a "TCP Port numbers reused" followed by several "TCP Retransmission". What it looks like is that multiple application servers are sending requests around the same time with the same source port.

 

I could be totally off here, but it seems like the Fortigate is having trouble processing that correctly from a NAT standpoint. This might explain why it only affects Internal>FW>Internal traffic rather than WAN>FW>Internal since traffic from the WAN side would always be coming from a different IP.

 

 

Eric_Lackey
New Contributor III

We think we finally have this one fixed. We created a Dynamic IP Pool with 100 IP addresses and chose that IP pool on the policy rather than "Use Outgoing Interface Address". We only enabled this IP pool for the policy for Internal>FW>Internal policy and not for WAN>FW>Internal policy. 

 

As soon as we made this change, the timeouts stopped. The only thing I can determine is that there is a bug in the Fortigate where it cannot properly handle this scenario when there are several (we have 8) internal hosts using the VIP. The Watchguard firewalls that we had in place before did not have this problem and firewall was the only thing that changed in our setup. 

 

Just to summarize - the issue occurs in a NAT Reflection scenario where there are multiple internal servers sending traffic to a VIP that forwards traffic back to internal servers on the same subnet. Eventually, multiple servers will send a request from the same port number within a few seconds of each other and that can cause the second request to timeout. When following a trace, we can see that the server sends a SYN packet that appears to make it through the FW and to the other server, but no ACK is ever returned. We will then see multiple SYN retransmissions until it finally times out. 

 

rdy2go

Did you ever hear back from Fortinet on this? This is still an issue on 5.2.7.