HA cluster "split brain" when downstream cisco stack reloads?

scheuri · ‎05-11-2023

Hi all

I already opened a support ticket for this - however, I'd like to have some input (maybe others had the same issue?).

Situation:

We deploy fortigate (60f and 100f), currently with 6.4.9, as clusters.

The (single) cluster link is physical and direct (without any switches, etc.).

Downstream there is a cisco stack with two cisco switch as members.

Fortigate cluter node A is connected to cisco stack member A and fortigate cluster node B ist connected to cisco stack member B.
This connection is made with a single copper cable (RJ45) on internal1/port1 (so no LACP/aggregation).

Problem:
When the cisco stack reboots/reloads the fortigate cluster member switches (or already has split brain). When the cisco stack comes up again in most of the time it ends in split brain (not switch).
Split brain means that the currently active node seems to hand over the primary role, but the secondary doesn't want to (because it wants it hand over again within the same second or so) - they both end up not really being primary or secondary and network is going down for the customer. One of the fortigate nodes needs to be rebooted in order to fix it.

Question:
Are we really the only ones to experience this issue?

I think this might be a bug (can't really prove it though) and Fortinet support says its more of a design issue - which I wonder: are we the first to adapt this?

Additional information (edit)

When removing "internal1" (where the downstream stack is connected on the FGTs) from the monitored interfaces, it works...and there is no split brain
The HA cable is always directly connected and does NOT go over the switches - this has been checked several times (by myself and others).
There are several locations affected with this kind of setup.
In the HA settings we have only one member (which is port "B" as in bravo). We are using Fortigate 60F in 90% of the locations (that are affected) and only one location (so far) has Fortigate 100F (where we use TWO HA members - HA1 and HA2)
The monitored interfaces are "wan1" (where the ISP Router is on) and "internal1" where the cisco stack is on
In HA config, override is disabled and one node has priority 200 and one has prio 100
We also suspected that a HA uptime difference of less than the default 300 seconds (or 5 minutes) might be an issue - but we tested this as well (having >10 minutes difference) with no change of outcome

Edit 2:

I just found: https://community.fortinet.com/t5/FortiGate/Technical-Tip-High-Availability-basic-deployment-design/...

We are in scenario 1 (without LACP) - this has been confirmed by our cisco guys that run the downstream stack.

Thanks for your input

gfleming · ‎05-11-2023

What ports are you using for HA on the 60F?

Cheers,
Graham

HA cluster "split brain" when downstream cisco stack reloads?

Nominate a Forum Post for Knowledge Article Creation

You are leaving our website