Technical Note: Error message 'worker_failure=1/1' or 'worker_failure=2/2' on SLBC cluster

Sabk_FTNT · ‎08-22-2017

Description

This article describes an issue that can occur on SLBC clusters in HA mode and explains how to troubleshoot and solve it.

If FortiController with chassis ID 1 and FortiController with chassis ID 2 are exchanged (FortiController 2 in chassis 1 - FortiController 1 in chassis 2), the communication between the FortiGate and the FortiController will not work correctly until the FortiGate blades are rebooted.

Exchanging FortiController blades is not an action performed under normal operation but it might be done for troubleshooting purposes.

Scope

SLBC cluster in HA mode.

2 chassis with at least 1 FortiGate blade in each.

Solution

One symptom seen on the FortiController with command 'diag sys ha status' reports "worker_failure=1/1" or "worker_failure=2/2". This means that the FortiController cannot communicate correctly with the FortiGate blades.

FTCtrl-1# diag sys ha status
mode: a-p
minimize chassis failover: 1
FTCtrl-1(FT503Cxxxxxxxx22), Slave(priority=1), ip=172.254.128.10, uptime=76.52, chassis=2(1)
    slot: 1
    sync: conf_sync=1, elbc_sync=0
    session: total=0, session_sync=out of sync
    state: gateway_die=0, worker_failure=1/1, lag=(total/good/down/bad-score)=2/2/0/0,
           intf_state=(port up)=0, force-state(0:none)
    hbdevs: local_interface=        b1 best=yes
            local_interface=        b2 best=no

FTCtrl-2(FT503Cxxxxxxxx33), Master(priority=0), ip=172.254.128.9, uptime=407781.23, chassis=1(1)
    slot: 1
    sync: conf_sync=1, elbc_sync=1, conn=3(connected)
    session: total=2034, session_sync=in sync
    state: gateway_die=0, worker_failure=0/1, lag=(total/good/down/bad-score)=2/2/0/0,
           intf_state=(port up)=0, force-state(0:none)
    hbdevs: local_interface=        b1 last_hb_time= 188.09   status=alive
            local_interface=        b2 last_hb_time= 188.09   status=alive

Another report is the load balance status that reports "waiting for data heartbeat":

FTCtrl-1# get load-balance status
ELBC Master Blade: N/A
Confsync Master Blade: N/A
Blades:
     Working: 0 [ 0 Active 0 Standby]
     Ready:    0 [ 0 Active 0 Standby]
     Dead:     1 [ 1 Active 0 Standby]
    Total:     1 [ 1 Active 0 Standby]

     Slot 3: Status:Dead      Function:Active
       Link:      Base: Up          Fabric: Up
       Heartbeat: Management: Good   Data: Failed
       Status Message:"Waiting for data heartbeat."

For communication between the FortiController and the FortiGate an internal elbc-base-ctrl IP address is used. This address is assigned by the FortiController:

IP 10.147.xxx.3 is assigned to FortiGate in slot 3 by FortiController with chassis ID 1
IP 10.147.xxx.19 is assigned to FortiGate in slot 3 by FortiController with chassis ID 2

The last digit corresponds to the slot number as seen with command 'diag test application chlbd 1':

FGT-1 (global) # diag test application chlbd 1

my service group id=1
my chassis=2
active channel=1
best active channel=1
master chassis=no
Other chassis is master=yes
my slot=19
master slot=3
other chassis master slot=3
chassis master slot=19
active slot mask=00080008(1.3,2.3)
chassis active slot mask=00080000(2.3)
update_timer is running
last_rx of update msg is 40 ago

To check which address is actually assigned to a FortiGate:

FGT-1 (elbc-mgmt) # diag ip add list | grep 10.147
IP=10.147.187.19->10.147.187.19/255.255.255.0 index=94 devname=elbc-base-ctrl

If the FortiController has been exchanged between the two chassis then the FortiGate will have two elbc-base-ctrl IP addresses, one for chassis ID 1 and the second for chassis ID2:

FGT-1 (elbc-mgmt) # diag ip add list | grep 10.147

IP=10.147.187.19->10.147.187.19/255.255.255.0 index=94 devname=elbc-base-ctrl
IP=10.147.187.3->10.147.187.3/255.255.255.0 index=94 devname=elbc-base-ctrl

These duplicate elbc-base-ctrl IP addresses will prevent normal SLBC cluster operation.

To clear this situation it is necessary to reboot the FortiGate blades in the master and slave chassis.

Summary of command used

FortiController:

diag sys ha status
get load-balance status

FortiGate:

config vdom
edit elbc-mgmt
diag ip add list | grep 10.147
end
config global
diag test application chlbd 1

Technical Note: Error message 'worker_failure=1/1' or 'worker_failure=2/2' on SLBC cluster

You are leaving our website