debugging ipsec nat traversal again, was working

journeyman · ‎07-15-2014

Having trouble with 2 out of 6 ipsec tunnels, all were working previously. Continues from my previous post debugging ipsec with nat traversal. I have seen enmoc' s blog post on debugging and tried to work through it. Comment - at each step, what does a good result look like? We have a central FGT60C connecting via nat-t ipsec to 6 FGT60C remotes. The tunnels are interface based / routed and running ospf. All remote sites connect via 3G (the tunnel is our backup path to the remote sites; the 3G router is the NAT device). All tunnels are configured from the same template. We previously had an error with the NAT configuration which has been rectified at all remote sites (link above). The central unit failed recently and was replaced. All 6 tunnels came back up. Subsequently 3 tunnels failed and have been continuously attempting to establish. One tunnel came up two nights ago and has been up since. I don' t know any event to explain why it came up. So we currently have 4 tunnels up and 2 tunnels down. - configurations checked, - PSK re-entered anyway (no reason to suspect this is the problem), - confirmed traffic sent and received on UDP:500 and UDP:4500 both ends, - establishment traffic is constant (approx 1kbps in both directions for each of the down tunnels), - packet capture shows most of the traffic is UDP:500 but there is some UDP:4500, - diag vpn ike gateway list shows both tunnels with approx zero age, - syslog shows that both tunnels do occasionally come up for 20 seconds at a time, at random intervals between 2 - 15 minutes apart - diag debug app ike -1 shows a message being sent from remote:500 to central:4500. This was not allowed in the NAT router config. I have now allowed this traffic. Each attempt to establish the tunnel includes the following traffic.

wan2 -- central.4500 -> remote.4500: udp 112
 wan2 -- remote.4500 -> central.4500: udp 80
 wan2 -- remote.4500 -> central.4500: udp 304
 wan2 -- remote.4500 -> central.500: udp 148
 wan2 -- remote.4500 -> central.4500: udp 304
 wan2 -- remote.4500 -> central.500: udp 148
 wan2 -- remote.4500 -> central.500: udp 148
 wan2 -- remote.4500 -> central.4500: udp 96
 wan2 -- remote.4500 -> central.4500: udp 304
 wan2 -- remote.4500 -> central.500: udp 148
 wan2 -- remote.4500 -> central.500: udp 148
 wan2 -- remote.4500 -> central.500: udp 148
 wan2 -- remote.4500 -> central.4500: udp 96
 wan2 -- remote.4500 -> central.500: udp 148
 wan2 -- remote.4500 -> central.500: udp 148
 wan2 -- remote.4500 -> central.500: udp 148
 wan2 -- remote.4500 -> central.500: udp 148
 wan2 -- remote.4500 -> central.4500: udp 304
 wan2 -- remote.4500 -> central.4500: udp 96
 wan2 -- remote.4500 -> central.500: udp 148
 wan2 -- remote.4500 -> central.500: udp 148
 wan2 -- remote.4500 -> central.500: udp 148
 wan2 -- remote.4500 -> central.4500: udp 96

This exchange lasts for 20 seconds and corresponds to the tunnel up syslog. Debug at both ends of a tunnel shows identical traffic (nothing is blocked anywhere). The central unit begins the exchange on 4500 but never replies again. I guess the fact that the central never relpies on 4500 is important? The remote tries 4500 > 500 which also seems odd. What should be my next step to fault find this issue, any suggestions please?

emnoc · ‎07-15-2014

To me sounds like your 3g provider is mangling up your ikev1. Try ikev2 in stead. It has built in improvements with NAT-t

PCNSE

NSE

StrongSwan

journeyman · ‎07-15-2014

Testing on one particular tunnel, I have changed to ike v2 each end and also reset the tunnel at each end. Now I see a proposal match and some evidence of ph2 but no success overall. sanitised debug:

central
 ike 0:central: link fail 8 1.1.1.1->2.2.2.2:500 dpd=1
 ike 0:central: DPD fail 8 1.1.1.1->2.2.2.2:500 send failure, resetting ...
 ike 0:central: deleting
 ike 0:central: flushing
 ike 0:central: flushed
 ike 0:central: deleted
 ike 0:central: created DPD triggered connection: 0x1352ef0 8 1.1.1.1->2.2.2.2:500.
 ike 0:central: new connection.
 ike 0:central:central-p2: chosen to populate IKE_SA traffic-selectors
 ike 0:central: no suitable IKE_SA, queuing CHILD_SA request and initiating IKE_SA negotiation
 ike 0:central:2967468: send SA_INIT
 ike 0:central:2967468: sent IKE msg (SA_INIT): 1.1.1.1:500->2.2.2.2:500, len=288
 ike 0:central:2967468: send RETRANSMIT_SA_INIT
 ike 0:central:2967468: sent IKE msg (RETRANSMIT_SA_INIT): 1.1.1.1:500->2.2.2.2:500, len=288
 
 remote
 ike 0: comes 1.1.1.1:500->3.3.3.3:500,ifindex=5....
 ike 0: IKEv2 exchange=SA_INIT id=eb1505b92255baf8/0000000000000000 len=288
 ike 0: in blahblah
 ike 0:remote:827457: detected retransmit
 ike 0: comes 1.1.1.1:500->3.3.3.3:500,ifindex=5....
 ike 0: IKEv2 exchange=SA_INIT id=4f4d37fd23fadde8/0000000000000000 len=288
 ike 0: in blahblah
 ike 0: found remote 3.3.3.3 5 -> 1.1.1.1:500
 ike 0:remote:827458: responder received SA_INIT msg
 ike 0:remote:827458: received notify type NAT_DETECTION_SOURCE_IP
 ike 0:remote:827458: processing NAT-D payload
 ike 0:remote:827458: NAT not detected
 ike 0:remote:827458: process NAT-D
 ike 0:remote:827458: received notify type NAT_DETECTION_DESTINATION_IP
 ike 0:remote:827458: processing NAT-D payload
 ike 0:remote:827458: NAT detected: ME
 ike 0:remote:827458: process NAT-D
 ike 0:remote:827458: incoming proposal:
 ike 0:remote:827458: proposal id = 1:
 ike 0:remote:827458:   protocol = IKEv2:
 ike 0:remote:827458:      encapsulation = IKEv2/none
 ike 0:remote:827458:         type=ENCR, val=AES_CBC (key_len = 128)
 ike 0:remote:827458:         type=INTEGR, val=AUTH_HMAC_SHA_96
 ike 0:remote:827458:         type=PRF, val=PRF_HMAC_SHA
 ike 0:remote:827458:         type=DH_GROUP, val=1024.
 ike 0:remote:827458: matched proposal id 1
 ike 0:remote:827458: responder preparing SA_INIT msg
 ike 0:remote:827458: out blahblah
 ike 0:remote:827458: sent IKE msg (SA_INIT_RESPONSE): 3.3.3.3:4500->1.1.1.1:500, len=288, id=4f4d37fd23fadde8/dca1155051546e0a
 ike 0:remote:827458: IKE SA 4f4d37fd23fadde8/dca1155051546e0a SK_ei 16:CC06340C5F4378F1D03AE8841C5EDE3F
 ike 0:remote:827458: IKE SA 4f4d37fd23fadde8/dca1155051546e0a SK_er 16:56268EA6A1FBF0043DD82110161BA453
 ike 0:remote:827458: IKE SA 4f4d37fd23fadde8/dca1155051546e0a SK_ai 20:993F2887D41B4E4DBBDF33ED808F2C0E67DC538D
 ike 0:remote:827458: IKE SA 4f4d37fd23fadde8/dca1155051546e0a SK_ar 20:6373287AB158D53B68B3A96854E17FBCF514CED8
 ike 0:remote:827452: negotiation timeout, deleting
 ike 0:remote: schedule auto-negotiate

Most settings are default btw. Any clues in the above, further suggestions?

emnoc · ‎07-16-2014

You need to diagnose your 3g Nat device including and packet filters and allowances for ESP proto 50. Whenbthge ph2-SAs fails what does packet captures show? E.g diag sniffer packet wan1 " src host x.x.x.x. and proto 50" Do that on each tunnel that fails for the remote and central side

PCNSE

NSE

StrongSwan

ede_pfau · ‎07-16-2014

I' m a bit irritated - if both FGTs are using NAT traversal, there should be no genuine ESP traffic be used at all, right? NAT-T encapsulates ESP to make it compatible with gateways not supporting/allowing ESP. What I could think of as a possible problem is the MTU of the 3G router. Packets might sometimes be too long, be fragmented or discarded. This would explain why sometimes the SPI is ' malformed' . Other than that, it seems that switching from udp/500 to udp/4500 is the moment when traffic traversal stops. I' d have a look at the router to find out why. (4500 is a high port, 500 a well know port).

Ede Kernel panic: Aiee, killing interrupt handler!

emnoc · ‎07-16-2014

Fwiw NAT-t has nothing to do with esp but for ike isamp For MTU issues you can confirm the pmtu via a simple ping and by toggling the DF bit.

PCNSE

NSE

StrongSwan

journeyman · ‎07-16-2014

There is no ESP traffic anywhere. I ran

diag sni pac <interface> ' proto 50'

at both ends and there was no traffic. Note that this filter sniffs all tunnels at the central. I don' t believe that there is fragmentation. Using ikev2 the packet bytes are mostly about 288 and they match each end. I did not do a fragmentation test. At each end I have run a capture for about 30 minutes

diag sni pac <interface> ' host <other ip>'  6

Throughout the capture I did a few pings from the remote end (can' t ping the other way due to the NAT). Approx frame counts are 1000 at the central and 800 at the remote - so there is some frame loss going on. The frames missing seem at first glance to be central > remote, but there is a port forward at the NAT router to handle this, so not yet sure the problem. Ping time varies quite a lot too - 130ms to 3000 ms. Since the FGT capture timestamp is " seconds from the start of the capture" I used a fast ping time to estimate the timestamp delta (about 1s). This has been subtracted out of one file and a merged pcap file created which I' ve started to investigate - but I' m new to comparing in wireshark.

emnoc · ‎07-17-2014

Packet loss is bad High pings are bad Lack of esp datagram is really bad Can you perform checks from mobile device to another within that carrier 3g cloud? The reason why I mention this, I ve did something similar using the SP orange and found high PLOS between mobile device to mobile device along.they suspect it' s a problem with the GGSN but who knows.

PCNSE

NSE

StrongSwan

ede_pfau · ‎07-17-2014

Fwiw NAT-t has nothing to do with esp but for ike isamp

Lack of esp datagram is really bad

emnoc, you' re right (as so often). I' ve never before really watched VPN traffic via ESP but now I did. Seeing no ESP packet data with at least one tunnel up is a very bad sign. Even without payload traffic I see keepalive packets in the ESP trace, from time to time.

Ede Kernel panic: Aiee, killing interrupt handler!

journeyman · ‎07-17-2014

I have got it working and now I have 6 happy ipsec tunnels. Yes, packet loss is bad and that is what I have concentrated on. In my experience packets are not lost in our telco cloud - often delayed, but not lost. It' s usually our own gear that loses things. And so it would seem again. The NAT router seemed OK so I looked at our cloud access router at the central end. I had thought it allowed the required outbound traffic implicitly but I made it explicit to be sure. The ikev1 tunnel came up once the router firewall policies were reset, but the ikev2 tunnel did not. I then reset the ikev2 back to ikev1 and it also came up. I have not investigated why ikev2 didn' t work. Regarding ESP, I am somewhat confused. I understand that a working ipsec nat tunnel comprises two layers. The external layer is UDP:4500 which is visible on the physical interface. ESP is the internal layer of the pipe, and is not visible to either the external interface or the ipsec interface? No matter which interface I sniff I still don' t see any ESP packets, which is consistent with the above. Have I got it wrong? Thanks for your posts, they' ve made me really think about what' s going on. I must say I' m pretty pleased with the merged capture file with correlated timestamps - which highlighted the lost packets and led to the solution.

debugging ipsec nat traversal again, was working

Nominate a Forum Post for Knowledge Article Creation