Skip to main content
journeyman
New Member
July 16, 2014
Question

debugging ipsec nat traversal again, was working

  • July 16, 2014
  • 7 replies
  • 27962 views

Having trouble with 2 out of 6 ipsec tunnels, all were working previously. Continues from my previous post debugging ipsec with nat traversal. I have seen enmoc' s blog post on debugging and tried to work through it. Comment - at each step, what does a good result look like? We have a central FGT60C connecting via nat-t ipsec to 6 FGT60C remotes. The tunnels are interface based / routed and running ospf. All remote sites connect via 3G (the tunnel is our backup path to the remote sites; the 3G router is the NAT device). All tunnels are configured from the same template. We previously had an error with the NAT configuration which has been rectified at all remote sites (link above). The central unit failed recently and was replaced. All 6 tunnels came back up. Subsequently 3 tunnels failed and have been continuously attempting to establish. One tunnel came up two nights ago and has been up since. I don' t know any event to explain why it came up. So we currently have 4 tunnels up and 2 tunnels down. - configurations checked, - PSK re-entered anyway (no reason to suspect this is the problem), - confirmed traffic sent and received on UDP:500 and UDP:4500 both ends, - establishment traffic is constant (approx 1kbps in both directions for each of the down tunnels), - packet capture shows most of the traffic is UDP:500 but there is some UDP:4500, - diag vpn ike gateway list shows both tunnels with approx zero age, - syslog shows that both tunnels do occasionally come up for 20 seconds at a time, at random intervals between 2 - 15 minutes apart - diag debug app ike -1 shows a message being sent from remote:500 to central:4500. This was not allowed in the NAT router config. I have now allowed this traffic. Each attempt to establish the tunnel includes the following traffic.

wan2 -- central.4500 -> remote.4500: udp 112
wan2 -- remote.4500 -> central.4500: udp 80
wan2 -- remote.4500 -> central.4500: udp 304
wan2 -- remote.4500 -> central.500: udp 148
wan2 -- remote.4500 -> central.4500: udp 304
wan2 -- remote.4500 -> central.500: udp 148
wan2 -- remote.4500 -> central.500: udp 148
wan2 -- remote.4500 -> central.4500: udp 96
wan2 -- remote.4500 -> central.4500: udp 304
wan2 -- remote.4500 -> central.500: udp 148
wan2 -- remote.4500 -> central.500: udp 148
wan2 -- remote.4500 -> central.500: udp 148
wan2 -- remote.4500 -> central.4500: udp 96
wan2 -- remote.4500 -> central.500: udp 148
wan2 -- remote.4500 -> central.500: udp 148
wan2 -- remote.4500 -> central.500: udp 148
wan2 -- remote.4500 -> central.500: udp 148
wan2 -- remote.4500 -> central.4500: udp 304
wan2 -- remote.4500 -> central.4500: udp 96
wan2 -- remote.4500 -> central.500: udp 148
wan2 -- remote.4500 -> central.500: udp 148
wan2 -- remote.4500 -> central.500: udp 148
wan2 -- remote.4500 -> central.4500: udp 96
This exchange lasts for 20 seconds and corresponds to the tunnel up syslog. Debug at both ends of a tunnel shows identical traffic (nothing is blocked anywhere). The central unit begins the exchange on 4500 but never replies again. I guess the fact that the central never relpies on 4500 is important? The remote tries 4500 > 500 which also seems odd. What should be my next step to fault find this issue, any suggestions please?

    7 replies

    emnoc
    New Member
    July 16, 2014
    To me sounds like your 3g provider is mangling up your ikev1. Try ikev2 in stead. It has built in improvements with NAT-t
    journeyman
    New Member
    July 16, 2014
    Testing on one particular tunnel, I have changed to ike v2 each end and also reset the tunnel at each end. Now I see a proposal match and some evidence of ph2 but no success overall. sanitised debug:
    central  ike 0:central: link fail 8 1.1.1.1->2.2.2.2:500 dpd=1  ike 0:central: DPD fail 8 1.1.1.1->2.2.2.2:500 send failure, resetting ...  ike 0:central: deleting  ike 0:central: flushing  ike 0:central: flushed  ike 0:central: deleted  ike 0:central: created DPD triggered connection: 0x1352ef0 8 1.1.1.1->2.2.2.2:500.  ike 0:central: new connection.  ike 0:central:central-p2: chosen to populate IKE_SA traffic-selectors  ike 0:central: no suitable IKE_SA, queuing CHILD_SA request and initiating IKE_SA negotiation  ike 0:central:2967468: send SA_INIT  ike 0:central:2967468: sent IKE msg (SA_INIT): 1.1.1.1:500->2.2.2.2:500, len=288  ike 0:central:2967468: send RETRANSMIT_SA_INIT  ike 0:central:2967468: sent IKE msg (RETRANSMIT_SA_INIT): 1.1.1.1:500->2.2.2.2:500, len=288    remote  ike 0: comes 1.1.1.1:500->3.3.3.3:500,ifindex=5....  ike 0: IKEv2 exchange=SA_INIT id=eb1505b92255baf8/0000000000000000 len=288  ike 0: in blahblah  ike 0:remote:827457: detected retransmit  ike 0: comes 1.1.1.1:500->3.3.3.3:500,ifindex=5....  ike 0: IKEv2 exchange=SA_INIT id=4f4d37fd23fadde8/0000000000000000 len=288  ike 0: in blahblah  ike 0: found remote 3.3.3.3 5 -> 1.1.1.1:500  ike 0:remote:827458: responder received SA_INIT msg  ike 0:remote:827458: received notify type NAT_DETECTION_SOURCE_IP  ike 0:remote:827458: processing NAT-D payload  ike 0:remote:827458: NAT not detected  ike 0:remote:827458: process NAT-D  ike 0:remote:827458: received notify type NAT_DETECTION_DESTINATION_IP  ike 0:remote:827458: processing NAT-D payload  ike 0:remote:827458: NAT detected: ME  ike 0:remote:827458: process NAT-D  ike 0:remote:827458: incoming proposal:  ike 0:remote:827458: proposal id = 1:  ike 0:remote:827458:   protocol = IKEv2:  ike 0:remote:827458:      encapsulation = IKEv2/none  ike 0:remote:827458:         type=ENCR, val=AES_CBC (key_len = 128)  ike 0:remote:827458:         type=INTEGR, val=AUTH_HMAC_SHA_96  ike 0:remote:827458:         type=PRF, val=PRF_HMAC_SHA  ike 0:remote:827458:         type=DH_GROUP, val=1024.  ike 0:remote:827458: matched proposal id 1  ike 0:remote:827458: responder preparing SA_INIT msg  ike 0:remote:827458: out blahblah  ike 0:remote:827458: sent IKE msg (SA_INIT_RESPONSE): 3.3.3.3:4500->1.1.1.1:500, len=288, id=4f4d37fd23fadde8/dca1155051546e0a  ike 0:remote:827458: IKE SA 4f4d37fd23fadde8/dca1155051546e0a SK_ei 16:CC06340C5F4378F1D03AE8841C5EDE3F  ike 0:remote:827458: IKE SA 4f4d37fd23fadde8/dca1155051546e0a SK_er 16:56268EA6A1FBF0043DD82110161BA453  ike 0:remote:827458: IKE SA 4f4d37fd23fadde8/dca1155051546e0a SK_ai 20:993F2887D41B4E4DBBDF33ED808F2C0E67DC538D  ike 0:remote:827458: IKE SA 4f4d37fd23fadde8/dca1155051546e0a SK_ar 20:6373287AB158D53B68B3A96854E17FBCF514CED8  ike 0:remote:827452: negotiation timeout, deleting  ike 0:remote: schedule auto-negotiate
    Most settings are default btw. Any clues in the above, further suggestions?
    emnoc
    New Member
    July 16, 2014
    You need to diagnose your 3g Nat device including and packet filters and allowances for ESP proto 50. Whenbthge ph2-SAs fails what does packet captures show? E.g diag sniffer packet wan1 " src host x.x.x.x. and proto 50" Do that on each tunnel that fails for the remote and central side
    ede_pfau
    SuperUser
    SuperUser
    July 16, 2014
    I' m a bit irritated - if both FGTs are using NAT traversal, there should be no genuine ESP traffic be used at all, right? NAT-T encapsulates ESP to make it compatible with gateways not supporting/allowing ESP. What I could think of as a possible problem is the MTU of the 3G router. Packets might sometimes be too long, be fragmented or discarded. This would explain why sometimes the SPI is ' malformed' . Other than that, it seems that switching from udp/500 to udp/4500 is the moment when traffic traversal stops. I' d have a look at the router to find out why. (4500 is a high port, 500 a well know port).
    emnoc
    New Member
    July 16, 2014
    Fwiw NAT-t has nothing to do with esp but for ike isamp For MTU issues you can confirm the pmtu via a simple ping and by toggling the DF bit.
    journeyman
    New Member
    July 17, 2014
    There is no ESP traffic anywhere. I ran
    diag sni pac <interface> ' proto 50' 
    at both ends and there was no traffic. Note that this filter sniffs all tunnels at the central. I don' t believe that there is fragmentation. Using ikev2 the packet bytes are mostly about 288 and they match each end. I did not do a fragmentation test. At each end I have run a capture for about 30 minutes
    diag sni pac <interface> ' host <other ip>'  6
    Throughout the capture I did a few pings from the remote end (can' t ping the other way due to the NAT). Approx frame counts are 1000 at the central and 800 at the remote - so there is some frame loss going on. The frames missing seem at first glance to be central > remote, but there is a port forward at the NAT router to handle this, so not yet sure the problem. Ping time varies quite a lot too - 130ms to 3000 ms. Since the FGT capture timestamp is " seconds from the start of the capture" I used a fast ping time to estimate the timestamp delta (about 1s). This has been subtracted out of one file and a merged pcap file created which I' ve started to investigate - but I' m new to comparing in wireshark.
    emnoc
    New Member
    July 17, 2014
    Packet loss is bad High pings are bad Lack of esp datagram is really bad Can you perform checks from mobile device to another within that carrier 3g cloud? The reason why I mention this, I ve did something similar using the SP orange and found high PLOS between mobile device to mobile device along.they suspect it' s a problem with the GGSN but who knows.
    ede_pfau
    SuperUser
    SuperUser
    July 17, 2014
    Fwiw NAT-t has nothing to do with esp but for ike isamp
    Lack of esp datagram is really bad
    emnoc, you' re right (as so often). I' ve never before really watched VPN traffic via ESP but now I did. Seeing no ESP packet data with at least one tunnel up is a very bad sign. Even without payload traffic I see keepalive packets in the ESP trace, from time to time.
    journeyman
    New Member
    July 18, 2014
    I have got it working and now I have 6 happy ipsec tunnels. Yes, packet loss is bad and that is what I have concentrated on. In my experience packets are not lost in our telco cloud - often delayed, but not lost. It' s usually our own gear that loses things. And so it would seem again. The NAT router seemed OK so I looked at our cloud access router at the central end. I had thought it allowed the required outbound traffic implicitly but I made it explicit to be sure. The ikev1 tunnel came up once the router firewall policies were reset, but the ikev2 tunnel did not. I then reset the ikev2 back to ikev1 and it also came up. I have not investigated why ikev2 didn' t work. Regarding ESP, I am somewhat confused. I understand that a working ipsec nat tunnel comprises two layers. The external layer is UDP:4500 which is visible on the physical interface. ESP is the internal layer of the pipe, and is not visible to either the external interface or the ipsec interface? No matter which interface I sniff I still don' t see any ESP packets, which is consistent with the above. Have I got it wrong? Thanks for your posts, they' ve made me really think about what' s going on. I must say I' m pretty pleased with the merged capture file with correlated timestamps - which highlighted the lost packets and led to the solution.
    ede_pfau
    SuperUser
    SuperUser
    July 18, 2014
    Glad it' s working. Actually you don' t see ESP packets neither on the ' internal' nor ' tunnelXYZ' interfaces. I found them using the ' any' interface designator showing that the ' ppp1' interface is reporting them (" diag deb ena; diag deb sni pa any ' proto 50' 4' " ). So the FGT does see ESP traffic but not where you would expect it. As for tracing sniffer output, you can have it with full timestamps: diag deb sniffer packet any ' proto 50' 4 999 a options for the third parameter: <tsformat> format of timestamp a: absolute UTC time, yyyy-mm-dd hh:mm:ss.ms otherwise: relative to the start of sniffing, ss.ms
    journeyman
    New Member
    July 18, 2014
    Thank you, adding UTC time format will be very useful, I do a fair bit of sniffing. I try to avoid sniffing the any interface, and didn' t even think of it. While things were not working I saw traffic from the remote with source port 4500 and destination port 500. As mentioned above I have allowed it, but can that be removed? It seems odd. Thanks again
    journeyman
    New Member
    July 20, 2014
    Just to add to the absolute timestamp comment, some tricks I found. The overall syntax is
    diagnose sniffer packet interface ' filter'  [[[verbosity] count] a]
    The last three parameters are optional but must be present in order if desired. A count of 0 means " sniffing continuously (ctrl C interrupts)" - this is the default.