Deploying FortiWeb VM in HA - Our findings (might be useful, would appreciate feedback)
I'm in the process of rolling out 2 FortiWeb VM's in HA to replace some aging Barracuda Load Balancers. Unfortunately documentation is a little vague/dispersed and there is no clear cut "How To" or "Cookbook" when compared to other Fortinet products for deploying the VM's in HA. So I thought I'd share some details that other new customers might find useful.
we're still in the process of configuring, so apologies if some of the terminology or facts are incorrect, I'd really appreciate any feedback on correction, improvements, problems or thing that might cause a security concern -
How HA works on FortiWeb - FortiWeb HA is only Active-Passive, each unit must be configured separately, each with its own IP. Once you configure HA, the Secondary unit will become a slave to the Main unit, it will lose the config you gave it including IP, so it can only be managed directly from the Main unit. This differs to some other manufacturers such as Barracuda where each devices retain the IP you gave them.
Configure HA first or with a Clean config - We initially deployed a single VM for PoC, which we configured up, we then decided to move this to production and simply add a second VM and place it into HA. When we setup HA, the 2 VM's had a conflict and the interfaces and VIP's started flip flopping. After getting no joy from Fortinet support, we factory reset the 2 VM's, with a clean config HA worked immediately. This probably could have been resolved through tweaking the config, but you could save yourself some time by just setting up the HA at the very beginning.
Configure your Virtual Network - We tested with various settings, but it turns out the default VMWare configure should be sufficient. Though we found we had to set Mac Address Changes to "Accept" on the "VM Network" and ports which handles the VIPs. We tried disabling this as an experiment, but the IPs and FortiWebs became unreachable. (Any feedback on this please, does anybody know if this is required? Or if we're doing something wrong?)
Flapping on member rejoining - We found that when a member was rebooted or just rejoining the group, the IP's appeared to start flapping, which caused long period (3-5 minutes) of interruption and intermittent ping losses. This appears to be caused by a combination of settings, which I'm still trying to fathom and find the best. But what we found was
Monitor ALL ports that are handling VIPs and traffic (Not just the uplink port), we found unmonitored ports took longer to stabilize.
Increase the "ARP Packet Numbers", default is 3, we increased this to 10 and then 16 (Ensure you understand the consequences of this though).
Stop the returning member from attempting to take control so quickly through changing the time it waits to 60 seconds, this needs to be done through the CLI and running[/ol]
config system ha set boot-time 60 end
Through doing this we've managed to get HA stable, flip flopping reduced from the mention 3-5 minutes intermittent disruption to availability, down to a couple of ping long loses.
We discovered a configuration issue on our Firewalls, they were not Source-NATing outbound traffic from the FortiWebs correctly, I can't see why this would cause flapping on internal subnets. But after fixing this we undid the settings above, and failover has been painless since. Though I'll leave the notes in, as they may be useful to somebody else.
For the flapping, can you show any details on your topology? That is not normal.
It might be a good thing for them to add to troubleshooting, too. Contact firstname.lastname@example.org directly. Ask for the FortiWeb writer, and mention that you couldn't find a solution for flapping in the docs.
A lot of it is available, but it is quite sporadic and there are no specific instructions for a VMWare environment, which is subtly different (for example having to configure Promiscuous mode, or allow MAC address to change).
The topology is flat/simple, the environment is just a few ESXi hosts sitting behind Firewalls. Hopefully I can just describe it as follows
Firewall (Physical Palo Alto Firewalls, will be replaced with Fortigate 200D's shortly though)[ul]
1 Uplink port to the Web1
1 Internal port used as follow[ul]
Assigned 10.100.101.1, Dedicated subnet for FortiWeb traffic, public IPs are NAT'd i.e. 10.100.101.60 which is a VIP on the FortiWeb.
Assigned 10.100.100.1 Server Subnet (Used as gateway for non-Fortiweb traffic)[/ul][/ul]
FortiWeb VMs in HA[ul]
1 Routing rule carrying 0.0.0.0 to 10.100.101.1
Port 1 assigned 10.100.101.5 receives traffic from Firewall
Port 2 assigned 10.100.100.5 passes traffic to and from Web Servers
Port 10 Assigned as Heartbeat[/ul][/ul]
If thats not enough info, let me know and I'll put a quick diagram together.
Shortly after rebooting Secondary unit, we'd see pings would start to drop to both Port 1 (10.100.101.5) and Port 2 (10.100.100.5) of the FortiWeb, it would be intermittant for about 30 seconds, then we'd see a long period of 1-2 minutes with no response, then it would become intermittant again and begin to stabalize.
Heartbeat was initial carried on the main VM Network, then we setup the seperate Network and things got a bit better. Tweaking those setting mentioned above
Good choice to put your heartbeat on its own network. (Ideally it's a direct link. In VM environments, the close equivalent is a vSwitch between, which maybe you hinted at with promiscuous mode...?) Allowing MAC changes would be a consequence of the mechanism in the help's diagram... to cause the network to failover, VMAC transfers to the secondary. So if the network doesn't recognize that the VMAC moved, then failover won't work. That's why the VIP became unreachable.
The failover should be near-instantaneous though. Not flapping/dropping for 3-5 minutes. That ARP is really high, too. Is your firewall directly attached to FortiWeb? Or is there something in between that is slow to "learn"...?
Also, is it required to restore the same FortiWeb as primary when it rejoins? (It's fewer failovers if you don't. Just let them elect their own master. When a member rejoins, let it become secondary -- restoring it to primary causes an extra failover, which is not ideal. Then you should't need the 60 second wait either.)
The Failover is near instant, I've had absolutely no issues with that. The flapping only occurs when one of the FortiWeb is rejoining the HA pairing after a reboot, I tried it with Override both on and off, but it really had little impact.
Our environment is very slightly more complex than I mention earlier, but not much
2x Firewalls (Active / Passive)
2x Layer 2 Switches (Linked for redundancy)
6x ESXi hosts, which have 2 uplink/production ports, 1 going to each switch (for resiliency).
So it is possible the heartbeat between Main and Secondary has to pass through 2 switches. Not a significant distance, would you recommend reducing the ARP?
You did the right thing by adjusting the boot time. But it may be worth looking into *why* this needed adjusting. Could indicate an issue.
It may be interesting to run a packet capture during election to see what is happening re: ARP...? Do you have many packet collisions on that network, for example?
It's just a bit odd that it needed that many gratuitous ARPs. When the physical IP or vserver changes, the first ARP should trigger failover. It's a bit of a race condition -- there's a few milliseconds maybe between ARP notify and table updates -- so maybe a couple packets dropped, but due to TCP's mechanism, that should be transparent to users. Just maybe lose a couple of ping or UDP packets, since ICMP/UDP does not feature reliable transport. 3 ARPs helps to guarantee the switchover, in case ARP #1 or #2 had a collision. But it's rare to need more than 6 unless the subnet has signal noise or other network issues.
Is there FTP or SSH to the back-end servers too? Or just HTTP/HTTPS and ping? If you don't mind, could you please PM the support ticket #?
Sorry for the delay in getting back to you - you were spot on, Failover should be instant and there should be no need change the ARP or Boot-time.
I found another issue where FortiWeb's were unable to validate licensing because they were losing routing to the internet, which I tracked down to be a problem with Source-NATing on our Firewalls.
Bizarrely, once I fixed the Source-NATing, suddenly the VM's starting behaving as you described. I can't understand why this happened, there is absolutely no NATing internally. Its only applicable to traffic leaving out network.
I tracked down to be a problem with Source-NATing on our Firewalls.
That may be the cause. And with HA, make sure you have 2 licenses: one for each VM.
If it looks like a VM license is stolen (for example, if many validation requests for the same license are coming from multiple machines & multiple IPs), then validation can fail. You'll see the same behavior on FortiAnalyzer etc. On FAZ there is a more strict bound IP, which you'll see in the Support portal.
FortiWeb VM will check every 30 minutes or so for license validation, so the Internet connection needs to be predictable -- it will retry, but does not tolerate infinite failures. (Trial license doesn't require Internet, but it is only 15 days & doesn't support HA...)
Remember to buy FortiGuard service for each machine. Otherwise, after failover, many features won't work, such as FortiGuard Antivirus, IP Reputation, and Security Services.
The Fortinet Security Fabric brings together the concepts of convergence and consolidation to provide comprehensive cybersecurity protection for all users, devices, and applications and across all network edges.