FortiGate
FortiGate Next Generation Firewall utilizes purpose-built security processors and threat intelligence security services from FortiGuard labs to deliver top-rated protection and high performance, including encrypted traffic.
aadenola
Staff
Staff
Article Id 214842
Description

This article describes what a split-brain scenario is in an HA setup and common causes.

Scope FortiGate, High Availability.
Solution

'Split-brain' is the term for when FortiGates in an HA cluster cannot communicate with each other on the heartbeat interface, causing each FortiGate to assume that they are the Primary. 

When in a split-brain scenario, each unit will have the same MAC addresses, which will cause an outage in the network.

 

Common symptoms of split-brain:
  • When logging into each FortiGate, 'get sys ha status' in the CLI or the System -> HA tab in the GUI only shows one unit.
  • Sessions cannot be established through the FortiGate. Traffic is dropped.
  • When trying to connect to the FortiGate cluster via administrative access, connections work intermittently. Sometimes traffic will hit one FortiGate. Other times it will hit the other.
 

To avoid a split-brain scenario:

  • In a two-member HA configuration, use back-to-back links (direct connection) for the heartbeat interface instead of connecting through a switch.
  • Use redundant HA heartbeat interfaces.
  • In a configuration where members are in different locations, ensure the heartbeat lost intervals and thresholds are longer than the possible latency in the links. See this article for more details on tuning these thresholds: Technical Tip: Changing the HA heartbeat timers to prevent false failover
  • Starting from v7.6.0, a backup heartbeat interface can be configured. The backup heartbeat interface is a dedicated interface used only when a secondary unit does not receive heartbeats from the primary through the regular heartbeat interfaces. Refer to the following article for a detailed explanation: Technical Tip: Mitigation of split brain issue when heartbeat interface is down.

config system ha
    set backup-hbdev <interface list>
end

 

Common causes of split-brain:

  1. Split-brain is usually caused by a complete loss of the heartbeat link or links. This can be a physical connectivity issue, or, less commonly, something blocking the heartbeat packets between the HA members.
  2. Congestion and latency in the heartbeat links that exceed the heartbeat lost intervals and thresholds.
  3. The slave device may be down or not able to boot up. 

 

Congestion on the heartbeat link can be caused when using the same link for session sync. For better latency, it is recommended to use another link/interface for session sync.

See this KB article for more info: Technical Tip: HA session-sync-dev configuration.

  • If the management PC used to access the FortiGates (via GUI or SSH) is located behind a switch connected to the cluster, this MAC duplication causes intermittent or failed administrative connectivity. The switch cannot reliably forward traffic to the correct FortiGate because it sees the same MAC address being advertised on multiple ports.


As a result:
GUI and SSH access to the FortiGate cluster becomes erratic or fails altogether
Sessions may intermittently succeed or timeout, depending on which FortiGate receives the traffic at a given time.

 

Below are the troubleshooting steps:

  1. Identify the heartbeat port and confirm if it is up.
    show system ha
show sys ha.PNG

 

diagnose hardware deviceinfo nic xxx  <----- Where xxx is the port name.

 

up.PNG

 

  1. Verify if the heartbeat ports are exchanging, sending, and receiving the Heartbeat packet.

    diagnose sniffer packet any "ether proto 0x8890" 4 <----- NAT/Route Mode Heartbeat.
    diagnose sniffer packet any "ether proto 0x8891" 4 <-----Transparent Mode Heartbeat.
    diagnose sniffer packet any "ether proto 0x8893" 4 <----- Configuration synchronization.

To stop the sniffer, use CTRL+C.

Verify HA configurations match between the HA members; settings such as HA mode, group-name, group-id, and passwords should be the same.

 

Assuming that packets are seen going both ways on the previous step, the following debug run on each unit may have more information on why they are not able to communicate:

diagnose debug reset
diagnose debug app hatalk -1

diagnose debug enable

 

To stop debugging:

diagnose debug disable

diagnose debug reset

 

  1. Ensure the firmware version of both units matches by running this command:

get system status

 

  1. Verify if the heartbeat interface is flapping by running this command on each unit:

diagnose sys ha history read

 

Primary:
diagnose sys ha history read
version=1.1
HA state change time: 2022-06-16 12:55:36
message_count=8/512
<2022-06-16 12:55:36> FGVMEVIJGWSKGW55 is elected as the cluster primary of 1 member
<2022-06-16 12:55:36> member FGVMEV_FDLRD6Y15 lost heartbeat on hbdev port2
<2022-06-16 12:55:36> heartbeats from FGVMEV_FDLRD6Y15 are lost on all hbdev
<2022-06-16 12:55:32> hbdev port2 link status changed: 1->0
 
Secondary:
diagnose sys ha history read
version=1.1
HA state change time: 2022-06-16 12:55:36
message_count=6/512
<2022-06-16 12:55:36> member FGVMEVIJGWSKGW55 lost heartbeat on hbdev port2
<2022-06-16 12:55:36> FGVMEV_FDLRD6Y15 is elected as the cluster primary of 1 member
<2022-06-16 12:55:36> heartbeats from FGVMEVIJGWSKGW55 are lost on all hbdev
 
Related document: