Support Forum
The Forums are a place to find answers on a range of Fortinet products from peers and product experts.
JamesB
New Contributor II

FortiGate 601F in HA Secondary all RJ45 ports are either green or off

Setup | Brand new pair of 601F's running in HA a-p mode. Running on version 7.0.9 but also saw this issue on 7.0.7. Minimal straightforward configuration - single root VDOM, etc.

 

Issue | Seems to be triggered by HA changes/reboots the secondary. All of the RJ45 ports  on the device are lit up but nothing's plugged in. Or they will all be showing down even though some should be up.

 

Details:
We had the pair running in HA utilizing ports port1 and port2 as the ha heartbeat ports (instead of the native HA port). Everything appeared fine until we upgraded the firmware (7.0.7). The upgrade failed due to the secondary losing communication with the primary. The reason for this is because once the secondary was rebooted with the fresh firmware the ports (port1 and port2) never came back online - so HA wouldn't come up.

 

 

On a physical level, all RJ45 ports link lights were green on the secondary (even though they were not connected).
We entered the following command:
#'get system interface physical'
The cli shows those same ports as down.

RJ45portsallon.png

 

We decided to break the HA, upgrade the primary by itself, then tried to re-establish ha. After a 3rd reboot of the secondary and manually putting both devices on the same firmware the box randomly started functioning as designed. We checked the logs but there was no way to identify what issue was with the ports malfunctioning and we decided to just leave it and if it happened again we would get support involved while it was happening.

 

A few weeks later another upgrade (7.0.9) came along and we pushed it. The exact same thing happened, this time we got support on the line who wasn't able to find anything wrong with the box via the cli and kept telling us the issue was due to the ports we had for HA not being plugged in. They didn't believe us that they were plugged in until we got someone onsite to take a picture and showed all status lights were off.

 

The secondary box was RMA'd.

 

New box arrives - upgraded it to match the other unit (7.0.9). Setup HA, boxes synced no issues. Runs for a week no issues.

 

Then randomly we come in today and the HA is broken. It isn't just out of sync the other box isn't even showing up in the GUI or via cli (get sys ha status). So we check and again all RJ45 port's link lights were on (this is the new box that just arrived last week from the RMA). We reboot and then all ports are off.

 

Last thing we tried was using the actual HA port. We moved the cable over to the native HA port and configured the CLI to match - same symptoms happen but now we can see both boxes in the ha status tab as the real HA port works.

Has anyone dealt with this before?

 

All ports Red are actually connected and should be upAll ports Red are actually connected and should be up

RJ45Portsalloff.png

UsingBuiltinHAport.png

 

 

Update 1:

It was discovered if we reload the box while all ports are disconnected it fixes itself.
Actual steps:

1. Unplug all ports

2. Reload the device

3. Once device is completely back up and running plug in the HA, waited for the primary to recognize it

4. Plug in the rest of the ports.

JamesB_1-1670875182060.png

 

This tells me something in the HA control plane between the primary and the secondary is erroring out the secondary box.

 

 

Update 2:

 

I am able to recreate the issue simply be rebooting the secondary box with ports plugged in.

When the system loads the following error shows up:

Initializing firewall...
System is starting...
[__bsearch_index:355] entry 0x934dbc0:0x7fc28b634964 duplicated action=find-dup, vdom=root, node=system.physical-switch.port.name, key=
Cannot append to table, switch_port_append,419
[__bsearch_index:355] entry 0x934dc70:0x7fc28b634964 duplicated action=find-dup, vdom=root, node=system.physical-switch.port.name, key=

 

'diagnose debug config-error-log read' is full of failed commands and parse errors

 

some examples:

>>> "next" @ global.system.interface.VLAN250:failed command (error 1)

>>> "set" "monitor" "Internal" "WAN1" @ global.system.ha:value parse error (error -651)
>>> "set" "interface" "VLAN707" @ root.system.dhcp.server.3:value parse error (error -3)
>>> "next" @ root.system.dhcp.server.3:failed command (error 1)

>>> "next" @ root.firewall.address.VLAN16 address:failed command (error 1)

>>> "set" "srcaddr" "VLAN705 address" @ root.firewall.policy.75:value parse error (error -3)
>>> "set" "srcaddr" "VLAN701 address" @ root.firewall.policy.69:value parse error (error -3)
>>> "set" "srcaddr" "VLAN705 address" @ root.firewall.policy.76:value parse error (error -3)

 

Again if I unplug all cables and reload none of these errors show up and the box functions as designed.

 

 

 

Update 3:

 

Issue has been further isolated to only surfacing when x1-x4 are utilized while the devices are in HA and the ports are in a LAG.  The remote device has a separate group for each device so LACP-SLAVE-HA Disabled should not be needed.

 

We have migrated the links to x5-x8 (no config changes just physical connection changes) and the issue can no longer be reproduced.

1 Solution
JamesB
New Contributor II

FortiNet was able to figure this out in their lab:

 

Problem Description:

 

The CPSS will hang, have high CPU usage and/or prevent proper bootup of the platform if traffic is flowing due to the serdes passing traffic it should not be passing.

 

Workaround is to prevent traffic from going over the links while device is booting (aka unplug the links x1-x4).

 

They are looking to have a fix in 7.0.10 which has a planned release date of feb-2023

 

You can tell if you are hitting the bug by the following:

 

Watch the console when the device is booting up

 

System is starting... <<<<<------------------- This will take much longer than normal after which errors like the following will appear

 

[__bsearch_index:355] entry 0x80c8a40:0x7fc69f9a5964 duplicated action=find-dup, vdom=root, node=system.physical-switch.port.name, key=

 

Cannot append to table, switch_port_append,419

 

[__bsearch_index:355] entry 0x80ca670:0x7fc69f9a5964 duplicated action=find-dup, vdom=root, node=system.physical-switch.port.name, key=

 

Cannot append to table, switch_port_append,419

 

now all your link lights will either be on or off and in both scenarios show all links are down.

 

Hope this helps.

 

View solution in original post

8 REPLIES 8
gfleming
Staff
Staff

Strange behaviour indeed. Have you verified the integrity of the cable being used for the HA connection; perhaps it is bad? Also it's a good idea to use two HA ports to help with HA issues when the main connection is not working properly.

 

When the issue happens and all the port LEDs are on what kind of things do you see in debug"

 

diagnose sys ha status
diagnose sys ha checksum
diagnose sys ha heartbeat


Cheers,
Graham
JamesB
New Contributor II

Hey Graham,

 

So I actually don't think anything is wrong with the direct HA link. I think HA is triggering a bug in the system causing the controller of the ports to error out. (sorry new to FortiNet so I don't know all the components in a FortiGate yet).

We originally had HA configured as Port1 and Port2. As none of the RJ45 ports work HA won't work on those ports. We noticed the MGMT port wasn't having any issues so we figured whatever the problem is the MGMT port is in a separate plane. Moved the HA to the built in HA port (using one of the same cables) and the HA linked but can't sync due to none of the RJ45 or fiber ports working properly.

 

gfleming

OK If you're sure the cable is good then move on to the debug commands I posted above. And by the way there's nothing stopping you from using the regular ports for HA. In fact it is encouraged as just having one HA link is not best practice.

 

Another hint would be to see the output of your current HA config:

 

show system ha

Cheers,
Graham
toxicshot

We already had HA running on two regular ports. Which is where we are running into the issue because the regular RJ45 ports are malfunctioning on reboot/upgrade.

 


We had the pair running in HA utilizing ports port1 and port2 as the ha heartbeat ports (instead of the native HA port).

If we use regular ports it goes like this:

Secondary Reboots > HA breaks > Secondary RJ45 ports are all lit up and not recognized > HA can't form because RJ45 ports are not recognized

 

If we use the native HA port it goes like this:

Secondary reboots > HA breaks > Secondary RJ45 light up/not recognized > HA forms initially but can't sync due to ports on secondary not being recognized

 

We've checked the physical cabling, tried different ports for HA, the secondary has already been RMA'd once. As @JamesB said above, if you reload the device as if it was its own system it loads up just fine and will join HA again.

 

 

 


 

gfleming

OK you keep repeating your problems and symptoms but we can't help you if you don't provide diagnostic output or config output that has been requested. Not much more to say on this one until you can provide the requested details.

Cheers,
Graham
JamesB
New Contributor II

FortiNet was able to figure this out in their lab:

 

Problem Description:

 

The CPSS will hang, have high CPU usage and/or prevent proper bootup of the platform if traffic is flowing due to the serdes passing traffic it should not be passing.

 

Workaround is to prevent traffic from going over the links while device is booting (aka unplug the links x1-x4).

 

They are looking to have a fix in 7.0.10 which has a planned release date of feb-2023

 

You can tell if you are hitting the bug by the following:

 

Watch the console when the device is booting up

 

System is starting... <<<<<------------------- This will take much longer than normal after which errors like the following will appear

 

[__bsearch_index:355] entry 0x80c8a40:0x7fc69f9a5964 duplicated action=find-dup, vdom=root, node=system.physical-switch.port.name, key=

 

Cannot append to table, switch_port_append,419

 

[__bsearch_index:355] entry 0x80ca670:0x7fc69f9a5964 duplicated action=find-dup, vdom=root, node=system.physical-switch.port.name, key=

 

Cannot append to table, switch_port_append,419

 

now all your link lights will either be on or off and in both scenarios show all links are down.

 

Hope this helps.

 

intersys

we are using the 601F without HA. Occured the same issue that above. Our device replace by RMA.

Same issue occured on new device.

intersys

we were using the x1 and x3 and x4. we moved the ports to x5-x8 and it solved now.

Labels
Top Kudoed Authors