VIP/IP-Pools stops working - ARP issue? 800C HA, A-A, 5.2.13
An odd error - A lot of services suddenly went offline yesterday evening at a client's datacenter. Almost nothing regarding NAT worked. Most of the VIPs was dead - The logs are empty! No traffic! (Lots of users, webpages etc. Incoming traffic 24/7.) Failing over to other fw makes it work for a while. Same with reboots. Editing the VIP, like changing the public IP and then save might make it work for a while. The same with IP-Pools - Changing the pool in any way makes it work, for a while. The only outgoing NAT that actually works all the time is the interface address. All virtual addresses are totally unreliable. No strange traffic or load of any kind.
ISP has no problems with routing, the prefixes are advertised, and we did a failover to backup router (VRRP/BGP) that's located in another DC - Same problem. Other vdoms has internet access and SNAT/DNAT also, and works. Other equipment (VPN-concentrator etc) works flawlessly, so think the ISP side of things are ok. Switches are ok.
Yes, FortiAnalyzer is deployed. Have logs some 120 days back. But nothing for the VIPs when they go offline. That's why we though this might be an ISP issue with ARP in the on-premise router. It for sure looks like no traffic is reaching the Fortigate.
Outgoing NAT:ed traffic is showing up as timeouts. Very weird it seems to work when you change IP-Pool. And it is random - One IP-Pool that worked earlier might be dead the next time you try. Although high numbers in the public /24 we use seems to work better then the low ones. How about that?!
Opened a case w. TAC. Customer is going to get a bunch of new Fortigates soonish (the 800c cluster is closing in on 5 years), but it would be grand if we could keep the old ones alive for some 4 months more...
I can feel your pain..data plane issues are really nasty expecially when you can control only one side of the moon devices!
Anyway from your description of the issue it seems to me that the problem it's on mac tables (re)learning and/or ARP gone snafu and blackholing your traffic, may be
1) on the "wan edge" (SWs and indeed the ISP routers) someone do not relearn FGT interface VMAC for VIPs outsite initials GARPs and so when the MAC on the CAM it's aged out (tipically 300s) or ARP (aging ?) you have a blackhole until a new FGT ARP response or a GARP (due to updating the vip,failover,reboot) will properly repopulate the MACs and/or ARP tables
2) on the FGT it self not replying to arp requests for VIPs so on upstream devices the vip mac and arp simply are aging out
This could be for the most disparate reasons :
1) at the "wan edge" : wrong ARP inspection/checks behaviour, countinuos premature CAM flushes due to STP TCNs, BU flooding, proxy arp, ecc
2) at the FGT : "internal errors" ^_^ that are preventing correctly ARP reponses for VIPs
from your first post i can infer that you have a DCI maybe with stretched VLANs so here we have another source of potential issues related to the DCI tech (VPLS,VXLAN,OTV,EVPN)
My humble suggestions are:
1) take a snapshot (interface pktcap,show arp,show mac) on both sides (ie FGT,switchets and maybe if the ISP it's collaborative on edge routers) when thigs are working and when they are not and compare the two
2) try arp-ping VIPs when things are not working to see if the problem it's on FGT side
3) try to see if gratuitous-arp-interval !=0 solve/mitagate the problem on the FGT
Good work and best regards,
DCI is just regular VLANS, no overlays of any kind. Behaviour is exactly the same on Fortigates in both DC.
One thing we haven't done is restarting the core switches both firewalls are connected to. The core switches are in a virtual-chassis setup, so they behave as one unit. Only L2 towards the ISP though. And all logs in core are looking good. And the rest of the vdoms are working as they should.
Will try to change the gratuitous arp setting on a few vips to see if it changes anything.
I've read your update too late..
so forcing ARP replay binding to specific interface seams to resolve the issue..
the most interesting thing it's "set src-nat-vip enable" since it's not directly related to ARP request/response and make me scratching my head:
AFAIK on FGT ARP replays for VIPs are sent to by default on originated request intf and not really enforced using "set extif" and IPOOL default behaviour it's to replay to all interfaces are coming from, this is handy with hairpinning but easily misleading when you have multiple wans and "dumb shared edge segment" or using SD-WAN on 5.6.x where i've hard learned to use "set associated-interface"..
but on 5.2 the only things that come to my mind (well to my evernote issues notebook=) are a FGT bug triggered by hairpinning
..or for some reason your FGT appear to answer to VIP/IPPOOLs ARP req coming/from differents intf/vlans or even upstream devices are learning MAC/ARP from other different interf/vlans then expected one, meaning that there is a subtle "BUM flooding leak" under the cover!
Are you using CISCO gears on core with VSS and/or VPC within edge wan?..time to time i've seen all sort of strange arp/mac bugs (flapping) with VSS/VPC when coupled with non CISCO LAGs :\
Just for the sake of curiosity if on the affected VIPs you revert the "set src-nat-vip enable" and use "set srcintf-filter" are they still working ?
Let me know what you find out. I had something similar last night when swapping an ASA to a new VDOM on an A/A 1500D's. All seemed to work for about 10 minutes then traffic not on the same inside interface IP segment quit accessing internet.
Could ping everything on inside, not the FGT inside IP. ARP table on Cores looked fine.....just couldn't ping FGT IP. Ended up disconnecting inside/outside interface and putting ASA back.
The Fortinet Security Fabric brings together the concepts of convergence and consolidation to provide comprehensive cybersecurity protection for all users, devices, and applications and across all network edges.