FortiGate 601F in HA Secondary all RJ45 ports are either green or off

JamesB · ‎12-12-2022

Setup | Brand new pair of 601F's running in HA a-p mode. Running on version 7.0.9 but also saw this issue on 7.0.7. Minimal straightforward configuration - single root VDOM, etc.

Issue | Seems to be triggered by HA changes/reboots the secondary. All of the RJ45 ports on the device are lit up but nothing's plugged in. Or they will all be showing down even though some should be up.

Details:
We had the pair running in HA utilizing ports port1 and port2 as the ha heartbeat ports (instead of the native HA port). Everything appeared fine until we upgraded the firmware (7.0.7). The upgrade failed due to the secondary losing communication with the primary. The reason for this is because once the secondary was rebooted with the fresh firmware the ports (port1 and port2) never came back online - so HA wouldn't come up.

On a physical level, all RJ45 ports link lights were green on the secondary (even though they were not connected).
We entered the following command:
#'get system interface physical'
The cli shows those same ports as down.

We decided to break the HA, upgrade the primary by itself, then tried to re-establish ha. After a 3rd reboot of the secondary and manually putting both devices on the same firmware the box randomly started functioning as designed. We checked the logs but there was no way to identify what issue was with the ports malfunctioning and we decided to just leave it and if it happened again we would get support involved while it was happening.

A few weeks later another upgrade (7.0.9) came along and we pushed it. The exact same thing happened, this time we got support on the line who wasn't able to find anything wrong with the box via the cli and kept telling us the issue was due to the ports we had for HA not being plugged in. They didn't believe us that they were plugged in until we got someone onsite to take a picture and showed all status lights were off.

The secondary box was RMA'd.

New box arrives - upgraded it to match the other unit (7.0.9). Setup HA, boxes synced no issues. Runs for a week no issues.

Then randomly we come in today and the HA is broken. It isn't just out of sync the other box isn't even showing up in the GUI or via cli (get sys ha status). So we check and again all RJ45 port's link lights were on (this is the new box that just arrived last week from the RMA). We reboot and then all ports are off.

Last thing we tried was using the actual HA port. We moved the cable over to the native HA port and configured the CLI to match - same symptoms happen but now we can see both boxes in the ha status tab as the real HA port works.

Has anyone dealt with this before?

All ports Red are actually connected and should be up

Update 1:

It was discovered if we reload the box while all ports are disconnected it fixes itself.
Actual steps:

1. Unplug all ports

2. Reload the device

3. Once device is completely back up and running plug in the HA, waited for the primary to recognize it

4. Plug in the rest of the ports.

This tells me something in the HA control plane between the primary and the secondary is erroring out the secondary box.

Update 2:

I am able to recreate the issue simply be rebooting the secondary box with ports plugged in.

When the system loads the following error shows up:

Initializing firewall...
System is starting...
[__bsearch_index:355] entry 0x934dbc0:0x7fc28b634964 duplicated action=find-dup, vdom=root, node=system.physical-switch.port.name, key=
Cannot append to table, switch_port_append,419
[__bsearch_index:355] entry 0x934dc70:0x7fc28b634964 duplicated action=find-dup, vdom=root, node=system.physical-switch.port.name, key=

'diagnose debug config-error-log read' is full of failed commands and parse errors

some examples:

>>> "next" @ global.system.interface.VLAN250:failed command (error 1)

>>> "set" "monitor" "Internal" "WAN1" @ global.system.ha:value parse error (error -651)
>>> "set" "interface" "VLAN707" @ root.system.dhcp.server.3:value parse error (error -3)
>>> "next" @ root.system.dhcp.server.3:failed command (error 1)

>>> "next" @ root.firewall.address.VLAN16 address:failed command (error 1)

>>> "set" "srcaddr" "VLAN705 address" @ root.firewall.policy.75:value parse error (error -3)
>>> "set" "srcaddr" "VLAN701 address" @ root.firewall.policy.69:value parse error (error -3)
>>> "set" "srcaddr" "VLAN705 address" @ root.firewall.policy.76:value parse error (error -3)

Again if I unplug all cables and reload none of these errors show up and the box functions as designed.

Update 3:

Issue has been further isolated to only surfacing when x1-x4 are utilized while the devices are in HA and the ports are in a LAG. The remote device has a separate group for each device so LACP-SLAVE-HA Disabled should not be needed.

We have migrated the links to x5-x8 (no config changes just physical connection changes) and the issue can no longer be reproduced.

JamesB · ‎01-19-2023

FortiNet was able to figure this out in their lab:

Problem Description:

The CPSS will hang, have high CPU usage and/or prevent proper bootup of the platform if traffic is flowing due to the serdes passing traffic it should not be passing.

Workaround is to prevent traffic from going over the links while device is booting (aka unplug the links x1-x4).

They are looking to have a fix in 7.0.10 which has a planned release date of feb-2023

You can tell if you are hitting the bug by the following:

Watch the console when the device is booting up

System is starting... <<<<<------------------- This will take much longer than normal after which errors like the following will appear

[__bsearch_index:355] entry 0x80c8a40:0x7fc69f9a5964 duplicated action=find-dup, vdom=root, node=system.physical-switch.port.name, key=

Cannot append to table, switch_port_append,419

[__bsearch_index:355] entry 0x80ca670:0x7fc69f9a5964 duplicated action=find-dup, vdom=root, node=system.physical-switch.port.name, key=

Cannot append to table, switch_port_append,419

now all your link lights will either be on or off and in both scenarios show all links are down.

Hope this helps.

View solution in original post

gfleming · ‎12-12-2022

Strange behaviour indeed. Have you verified the integrity of the cable being used for the HA connection; perhaps it is bad? Also it's a good idea to use two HA ports to help with HA issues when the main connection is not working properly.

When the issue happens and all the port LEDs are on what kind of things do you see in debug"

diagnose sys ha status
diagnose sys ha checksum
diagnose sys ha heartbeat

Cheers,
Graham

JamesB · ‎12-12-2022

Hey Graham,

So I actually don't think anything is wrong with the direct HA link. I think HA is triggering a bug in the system causing the controller of the ports to error out. (sorry new to FortiNet so I don't know all the components in a FortiGate yet).

We originally had HA configured as Port1 and Port2. As none of the RJ45 ports work HA won't work on those ports. We noticed the MGMT port wasn't having any issues so we figured whatever the problem is the MGMT port is in a separate plane. Moved the HA to the built in HA port (using one of the same cables) and the HA linked but can't sync due to none of the RJ45 or fiber ports working properly.

gfleming · ‎12-12-2022

OK If you're sure the cable is good then move on to the debug commands I posted above. And by the way there's nothing stopping you from using the regular ports for HA. In fact it is encouraged as just having one HA link is not best practice.

Another hint would be to see the output of your current HA config:

show system ha

Cheers,
Graham

toxicshot · ‎12-13-2022

We already had HA running on two regular ports. Which is where we are running into the issue because the regular RJ45 ports are malfunctioning on reboot/upgrade.

We had the pair running in HA utilizing ports port1 and port2 as the ha heartbeat ports (instead of the native HA port).

If we use regular ports it goes like this:

Secondary Reboots > HA breaks > Secondary RJ45 ports are all lit up and not recognized > HA can't form because RJ45 ports are not recognized

If we use the native HA port it goes like this:

Secondary reboots > HA breaks > Secondary RJ45 light up/not recognized > HA forms initially but can't sync due to ports on secondary not being recognized

We've checked the physical cabling, tried different ports for HA, the secondary has already been RMA'd once. As @JamesB said above, if you reload the device as if it was its own system it loads up just fine and will join HA again.

gfleming · ‎12-13-2022

OK you keep repeating your problems and symptoms but we can't help you if you don't provide diagnostic output or config output that has been requested. Not much more to say on this one until you can provide the requested details.

Cheers,
Graham

JamesB · ‎01-19-2023

FortiNet was able to figure this out in their lab:

Problem Description:

The CPSS will hang, have high CPU usage and/or prevent proper bootup of the platform if traffic is flowing due to the serdes passing traffic it should not be passing.

Workaround is to prevent traffic from going over the links while device is booting (aka unplug the links x1-x4).

They are looking to have a fix in 7.0.10 which has a planned release date of feb-2023

You can tell if you are hitting the bug by the following:

Watch the console when the device is booting up

System is starting... <<<<<------------------- This will take much longer than normal after which errors like the following will appear

[__bsearch_index:355] entry 0x80c8a40:0x7fc69f9a5964 duplicated action=find-dup, vdom=root, node=system.physical-switch.port.name, key=

Cannot append to table, switch_port_append,419

[__bsearch_index:355] entry 0x80ca670:0x7fc69f9a5964 duplicated action=find-dup, vdom=root, node=system.physical-switch.port.name, key=

Cannot append to table, switch_port_append,419

now all your link lights will either be on or off and in both scenarios show all links are down.

Hope this helps.

intersys · ‎01-22-2023

we are using the 601F without HA. Occured the same issue that above. Our device replace by RMA.

Same issue occured on new device.

intersys · ‎01-22-2023

we were using the x1 and x3 and x4. we moved the ports to x5-x8 and it solved now.

FortiGate 601F in HA Secondary all RJ45 ports are either green or off

Nominate a Forum Post for Knowledge Article Creation

You are leaving our website