FortiGate
FortiGate Next Generation Firewall utilizes purpose-built security processors and threat intelligence security services from FortiGuard labs to deliver top-rated protection and high performance, including encrypted traffic.
aadenola
Staff
Staff
Article Id 214842
Description

This article describes what a split-brain scenario is in an HA setup and the common causes.

Scope FortiGate, High Availability.
Solution

'split-brain'  is the term for when the FortiGates in an HA cluster cannot communicate with each other on the heartbeat interface, causing each FortiGate to assume that they are the Primary. 

When in a split-brain scenario each unit will have the same MAC addresses, which will cause an outage in the network.

 

Common symptoms of split-brain:
  • When logging into each individual FortiGate, 'get sys ha status' in the CLI or the System -> HA tab in the GUI only shows one unit.
  • Sessions cannot be established through the FortiGate. Traffic is dropped.
  • When trying to connect to the FortiGate cluster via administrative access, connections work intermittently. Sometimes traffic will hit one FortiGate. Other times it will hit the other.
 

To avoid a split-brain scenario:

  • In a two-member HA configuration, use back-to-back links (direct connection) for the heartbeat interface instead of connecting through a switch.
  • Use redundant HA heartbeat interfaces.
  • In a configuration where members are in different locations, ensure the heartbeat lost intervals and thresholds are longer than the possible latency in the links. See this article for more details on tuning these thresholds: Technical Tip: Changing the HA heartbeat timers to prevent false failover 

 

Common causes of split-brain:

  1. Split-brain is usually caused by complete loss of the heartbeat link or links. This can be a physical connectivity issue, or less commonly, something blocking the heartbeat packets between the HA members.
  2. Congestion and latency in the heartbeat links that exceed the heartbeat lost intervals and thresholds.

 

Congestion on the heartbeat link can be caused when using the same link for session sync. For better latency, it is recommended to use another link/interface for session sync.

See this article for more info:  Technical Tip: HA session-sync-dev configuration - Fortinet Community  

Below are the troubleshooting steps:

 

  1. Identify the heartbeat port and confirm if it is up.

show system ha

show sys ha.PNG

 

diagnose hardware deviceinfo nic xxx  <----- Where xxx is the port name.

 

up.PNG

 

  1. Verify if the heartbeat ports are exchanging sending and receiving the Heartbeat packet.

    dia sniffer packet any "ether proto 0x8890"4 <----- NAT/Route Mode Heartbeat.
    dia sniffer packet any "ether proto 0x8891" 4 <-----Transparent Mode Heartbeat.
    dia sniffer packet any "ether proto 0x8893" 4 <----- Configuration synchronization.

To stop the sniffer use CTRL+C.

Verify HA configurations are matching between the HA members, settings such as HA mode, group-name, group-id, and passwords should be the same.

 

Assuming that packets are seen going both ways on the previous step, the following debug run on each unit may have more information on why they are not able to communicate:

di de res
di de app hatalk -1

di de en

 

To stop debugs:

 

dia de disable

dia de reset

 

  1. Ensure the firmware version of both units matches by running this command:

    get system status

     

  2. Verify if the heartbeat interface is flapping by running this command on each unit:

     

    dia sys ha history read

     

Primary:
 
dia sys ha history read
version=1.1
HA state change time: 2022-06-16 12:55:36
message_count=8/512
<2022-06-16 12:55:36> FGVMEVIJGWSKGW55 is elected as the cluster primary of 1 member
<2022-06-16 12:55:36> member FGVMEV_FDLRD6Y15 lost heartbeat on hbdev port2
<2022-06-16 12:55:36> heartbeats from FGVMEV_FDLRD6Y15 are lost on all hbdev
<2022-06-16 12:55:32> hbdev port2 link status changed: 1->0
 
Secondary:
 
dia sys ha history read
version=1.1
HA state change time: 2022-06-16 12:55:36
message_count=6/512
<2022-06-16 12:55:36> member FGVMEVIJGWSKGW55 lost heartbeat on hbdev port2
<2022-06-16 12:55:36> FGVMEV_FDLRD6Y15 is elected as the cluster primary of 1 member
<2022-06-16 12:55:36> heartbeats from FGVMEVIJGWSKGW55 are lost on all hbdev
 
Related document: