Support Forum
The Forums are a place to find answers on a range of Fortinet products from peers and product experts.
kallbrandt
Contributor II

VIP/IP-Pools stops working - ARP issue? 800C HA, A-A, 5.2.13

Hello,

An odd error - A lot of services suddenly went offline yesterday evening at a client's datacenter. Almost nothing regarding NAT worked. Most of the VIPs was dead - The logs are empty! No traffic! (Lots of users, webpages etc. Incoming traffic 24/7.) Failing over to other fw makes it work for a while. Same with reboots. Editing the VIP, like changing the public IP and then save might make it work for a while. The same with IP-Pools - Changing the pool in any way makes it work, for a while. The only outgoing NAT that actually works all the time is the interface address. All virtual addresses are totally unreliable. No strange traffic or load of any kind.

 

ISP has no problems with routing, the prefixes are advertised, and we did a failover to backup router (VRRP/BGP) that's located in another DC - Same problem. Other vdoms has internet access and SNAT/DNAT also, and works. Other equipment (VPN-concentrator etc) works flawlessly, so think the ISP side of things are ok. Switches are ok.

 

execute clear system arp table

 

Did actually work a few times.

 

Any ideas gentlemen? A bit lost with this one...

 

(Will open a high prio case with TAC)

Richie

NSE7

Richie NSE7
14 REPLIES 14
ericli_FTNT
Staff
Staff

Hi Richie, failure of device without any log left is always not good.

 

Did you double check the log setting? Do you deploy and central logging device like FortiAnalyzer? Your case is critical for us. Please keep updated. Thanks!

kallbrandt

Yes, FortiAnalyzer is deployed. Have logs some 120 days back. But nothing for the VIPs when they go offline. That's why we though this might be an ISP issue with ARP in the on-premise router. It for sure looks like no traffic is reaching the Fortigate.

Outgoing NAT:ed traffic is showing up as timeouts. Very weird it seems to work when you change IP-Pool. And it is random - One IP-Pool that worked earlier might be dead the next time you try. Although high numbers in the public /24 we use seems to work better then the low ones. How about that?!

Richie

NSE7

Richie NSE7
kallbrandt

Opened a case w. TAC. Customer is going to get a bunch of new Fortigates soonish (the 800c cluster is closing in on 5 years), but it would be grand if we could keep the old ones alive for some 4 months more...

 

Richie

NSE7

Richie NSE7
kallbrandt

Update:

 

If I set all VIPs to bi-di NAT (set src-nat-vip enable) they start to work.

And if I map the non-working IP-pools to the interface (set arp-intf xxx) they start to work.

 

So ARP-issue of some sort.

Richie

NSE7

Richie NSE7
Antonio_Milanese

Hello Richie, I can feel your pain..data plane issues are really nasty expecially when you can control only one side of the moon devices! Anyway from your description of the issue it seems to me that the problem it's on mac tables (re)learning and/or ARP gone snafu and blackholing your traffic, may be 1) on the "wan edge" (SWs and indeed the ISP routers) someone do not relearn FGT interface VMAC for VIPs outsite initials GARPs and so when the MAC on the CAM it's aged out (tipically 300s) or ARP (aging ?) you have a blackhole until a new FGT ARP response or a GARP (due to updating the vip,failover,reboot) will properly repopulate the MACs and/or ARP tables 2) on the FGT it self not replying to arp requests for VIPs so on upstream devices the vip mac and arp simply are aging out This could be for the most disparate reasons : 1) at the "wan edge" : wrong ARP inspection/checks behaviour, countinuos premature CAM flushes due to STP TCNs, BU flooding, proxy arp, ecc 2) at the FGT : "internal errors" ^_^ that are preventing correctly ARP reponses for VIPs from your first post i can infer that you have a DCI maybe with stretched VLANs so here we have another source of potential issues related to the DCI tech (VPLS,VXLAN,OTV,EVPN) My humble suggestions are: 1) take a snapshot (interface pktcap,show arp,show mac) on both sides (ie FGT,switchets and maybe if the ISP it's collaborative on edge routers) when thigs are working and when they are not and compare the two 2) try arp-ping VIPs when things are not working to see if the problem it's on FGT side 3) try to see if gratuitous-arp-interval !=0 solve/mitagate the problem on the FGT Good work and best regards, Antonio

kallbrandt

Thank you for your response!

 

Good suggestions!

 

DCI is just regular VLANS, no overlays of any kind. Behaviour is exactly the same on Fortigates in both DC.

One thing we haven't done is restarting the core switches both firewalls are connected to. The core switches are in a virtual-chassis setup, so they behave as one unit. Only L2 towards the ISP though. And all logs in core are looking good. And the rest of the vdoms are working as they should.

 

Will try to change the gratuitous arp setting on a few vips to see if it changes anything.

 

Again, thank you!

Richie

NSE7

Richie NSE7
Antonio_Milanese

Hello Richie I've read your update too late.. so forcing ARP replay binding to specific interface seams to resolve the issue.. the most interesting thing it's "set src-nat-vip enable" since it's not directly related to ARP request/response and make me scratching my head: AFAIK on FGT ARP replays for VIPs are sent to by default on originated request intf and not really enforced using "set extif" and IPOOL default behaviour it's to replay to all interfaces are coming from, this is handy with hairpinning but easily misleading when you have multiple wans and "dumb shared edge segment" or using SD-WAN on 5.6.x where i've hard learned to use "set associated-interface".. but on 5.2 the only things that come to my mind (well to my evernote issues notebook=) are a FGT bug triggered by hairpinning http://kb.fortinet.com/kb....do?externalID=FD37124 ..or for some reason your FGT appear to answer to VIP/IPPOOLs ARP req coming/from differents intf/vlans or even upstream devices are learning MAC/ARP from other different interf/vlans then expected one, meaning that there is a subtle "BUM flooding leak" under the cover! Are you using CISCO gears on core with VSS and/or VPC within edge wan?..time to time i've seen all sort of strange arp/mac bugs (flapping) with VSS/VPC when coupled with non CISCO LAGs :\ Just for the sake of curiosity if on the affected VIPs you revert the "set src-nat-vip enable" and use "set srcintf-filter" are they still working ? Regards, Antonio

kallbrandt

Yes, agree, src-nat-vip shouldn't really be related to ARP issue.

 

No cisco equipment here, only Alcatel-Lucent 6900/6860 in core.

 

Will set a few VIPs/pools back to original setting during night and be in very early to test.

 

Must check out what's going on in the "internet-vlan" with Wireshark first hand.

 

Again, thank you for your input. Highly appreaciated!

Richie

NSE7

Richie NSE7
piacas
New Contributor III

Let me know what you find out. I had something similar last night when swapping an ASA to a new VDOM on an A/A 1500D's. All seemed to work for about 10 minutes then traffic not on the same inside interface IP segment quit accessing internet. 

 

Could ping everything on inside, not the FGT inside IP. ARP table on Cores looked fine.....just couldn't ping FGT IP. Ended up disconnecting inside/outside interface and putting ASA back.

 

Opened a ticket, waiting to hear back. 

Labels
Top Kudoed Authors