Fortinet Forum
The Forums are a place to find answers on a range of Fortinet products from peers and product experts.
ede_pfau
Esteemed Contributor III

HA over VLAN to remote FGT?

hello all,

 

I'm planning to place the slave unit of a Fortigate HA cluster into a remote location. There is a leased line (layer 2) for the HA connect. Can anybody confirm that I can run the HA traffic across a VLAN between the access switches on each side of the line?

 

I know that HA traffic uses a non-standard ethertype, and I've tested that HA traffic is transfered unchanged over that line. But now there will be VRRP traffic between 2 routers on this line as well, and I'd like to isolate the HA traffic on a VLAN of it's own.

 

There is the option to enable authentication and encryption of the HA traffic but this will cost performance. Though it will isolate the traffic I guess.

 

I appreciate any advice, esp. from someone who has already separated a HA cluster geographically.


Ede

"Kernel panic: Aiee, killing interrupt handler!"
1 Solution
MrSinners

"Latency is important" did not fully bring my point across, 15 ms is more than enough, even 100 ms would do. Depending on the setup of the customer and the quality of the leased line, a situation could occur in which some heartbeat packets are not send out quickly enough or some are missed by the other node and an active-active split brain situation occurs, which causes all traffic to be dropped. This could happen because of:

- congestion on the leased line

- other provider issues or maintenance

- being targeted by a ddos attack

- a higher amount of incoming/outgoing traffic than expected

- inspecting more traffic than anticipated or the unit can handle, causing high CPU load which might prevent handling of the HB packets

- the amount of sessions being synced between the units and whether sessions-less sessions are synced (udp and icmp)

 

When one of these points occurs some traffic will be affected but not all of it, but when HB packets are missed and a split brain situation happens all traffic is pretty much over until the nodes see each other again and the cluster is restored. The chances of this actually happening is very low. Things to look out for is the System/HA logging and look for "HB interface lost" messages. Depending on the cause of these issues, different solutions might apply. However, if you want the cluster to be more lenient when missing some HB packets, fine tuning is possible of the following settings in the "config system ha" configuration:

hb-lost-threshold <threshold_integer>      default value = 6 (which allows 5 packets to be missed before the HB interface is marked as "lost", at the 6th missed HB packet the interface is marked as "lost")

hb-interval <interval_integer>                   default value = 2 (which makes it 200 ms)

 

We can calculate the time in which the FortiGate marks a HB interface as lost by combining these values: 6 x 200 ms = 1 second and 200 ms. Depending on timing this can be slightly less or higher. Only change these values after investigating HB interface lost messages and you are certain this is the right thing to do, as this can be caused by other factors (e.g. the patch cable to the switch could be broken)

 

More information at http://kb.fortinet.com/kb/documentLink.do?externalID=10043

View solution in original post

15 REPLIES 15
MrSinners

"Latency is important" did not fully bring my point across, 15 ms is more than enough, even 100 ms would do. Depending on the setup of the customer and the quality of the leased line, a situation could occur in which some heartbeat packets are not send out quickly enough or some are missed by the other node and an active-active split brain situation occurs, which causes all traffic to be dropped. This could happen because of:

- congestion on the leased line

- other provider issues or maintenance

- being targeted by a ddos attack

- a higher amount of incoming/outgoing traffic than expected

- inspecting more traffic than anticipated or the unit can handle, causing high CPU load which might prevent handling of the HB packets

- the amount of sessions being synced between the units and whether sessions-less sessions are synced (udp and icmp)

 

When one of these points occurs some traffic will be affected but not all of it, but when HB packets are missed and a split brain situation happens all traffic is pretty much over until the nodes see each other again and the cluster is restored. The chances of this actually happening is very low. Things to look out for is the System/HA logging and look for "HB interface lost" messages. Depending on the cause of these issues, different solutions might apply. However, if you want the cluster to be more lenient when missing some HB packets, fine tuning is possible of the following settings in the "config system ha" configuration:

hb-lost-threshold <threshold_integer>      default value = 6 (which allows 5 packets to be missed before the HB interface is marked as "lost", at the 6th missed HB packet the interface is marked as "lost")

hb-interval <interval_integer>                   default value = 2 (which makes it 200 ms)

 

We can calculate the time in which the FortiGate marks a HB interface as lost by combining these values: 6 x 200 ms = 1 second and 200 ms. Depending on timing this can be slightly less or higher. Only change these values after investigating HB interface lost messages and you are certain this is the right thing to do, as this can be caused by other factors (e.g. the patch cable to the switch could be broken)

 

More information at http://kb.fortinet.com/kb/documentLink.do?externalID=10043

GusTech

What HA are you running and what is the bandwith of the vlanlink? im also planning to do this

Fortigate <3

MikePruett
Valued Contributor

It doesn't take much. The main thing is latency. I have a client that has a dedicated fiber link ran over the course of a few miles that handles HA and it works beautiifully

GusTech

MikePruett wrote:

It doesn't take much. The main thing is latency. I have a client that has a dedicated fiber link ran over the course of a few miles that handles HA and it works beautiifully

Just wondering. (I have around 1ms latency) and going for to run A-P..

 

But, if someone running this in full A-A with UTM. I wonder how much bandwidth is required - i have never done any testing on it

Fortigate <3

ede_pfau
Esteemed Contributor III

Put this into service yesterday. It's an A-P cluster running v4.3.18 over a dedicated Colt "LAN Link" with 100 Mbps. Latency is around 2 ms with no traffic, up to 60 ms with a lot of traffic. Distance (by air) around 2-3 miles.

 

First off, it works. I meant the connection to only carry 3 VLANs, 2 for the HA links (redundant) and one for the router-to-router VRRP traffic. As it turned out, any traffic is carried across that link, not only the planned for VLAN traffic. I didn't see this coming...

 

VRRP setup: one route to the next higher hop is tracked, plus additionally the link status to the firewall (router LAN port -> FGT wan port). VRRP failover is working correctly. But...as the FGT failover is not synchronized with the VRRP status it occurred that the main router failed over to it's backup but the main FGT still went on running. Then, all internet bound traffic traversed the router-to-router link which of course was easily saturated.

 

Then I set up a ping target (gwdetect) for the FGTs, tracking the next hop router. Ultimately, device and link failover is nice but rare; connectivity loss occurs more often.

 

That was a catch22: when the main access was cut (for testing), the router failed over, but still the FGT could reach the next hop because now the ping went out across the router-to-router link.

Then we put the next hop router down.

Now both FGTs, tracking the same target, assumed 'line down' and both stopped processing traffic. Not good.

Took me a while to understand why I could ping the FGTs but no traffic was passing. Ping server settings are synched in HA clusters.

 

To fix this my first thought is to block ICMP on the router-to-router link. The Cisco routers seem to not be capable to do this. A small FGT in Transparent Mode will do it easily. But it doesn't feel right...

 

So I'm grateful for any suggestions how to handle this without additional hardware.


Ede

"Kernel panic: Aiee, killing interrupt handler!"
MrSinners

The software version is rather old, any reason to not use 5.2.7 or 5.2.8?

 

The reason the slave FG is using the same pingserver result as the previous master is most likely due to a setting named "set pingserver-slave-force-reset disable" which should be set to "enable". Im not sure if this is present in 4.3.18.

 

Can you clarify what you mean with "all traffic" is going over the router-to-router link? Do you mean that all normal traffic between active router and active FG is also send over the colt link to the non-active components? As you have to connect all interfaces to both FG nodes equally, and since vrrp routing address is used to route traffic, its not that unexpected there is some traffic going over it and in a worst case scenario it might route all traffic over it.