FortiGate 601F in HA Secondary all RJ45 ports are either green or off
Setup | Brand new pair of 601F's running in HA a-p mode. Running on version 7.0.9 but also saw this issue on 7.0.7. Minimal straightforward configuration - single root VDOM, etc.
Issue | Seems to be triggered by HA changes/reboots the secondary. All of the RJ45 ports on the device are lit up but nothing's plugged in. Or they will all be showing down even though some should be up.
Details: We had the pair running in HA utilizing ports port1 and port2 as the ha heartbeat ports (instead of the native HA port). Everything appeared fine until we upgraded the firmware (7.0.7). The upgrade failed due to the secondary losing communication with the primary. The reason for this is because once the secondary was rebooted with the fresh firmware the ports (port1 and port2) never came back online - so HA wouldn't come up.
On a physical level, all RJ45 ports link lights were green on the secondary (even though they were not connected). We entered the following command: #'get system interface physical' The cli shows those same ports as down.
We decided to break the HA, upgrade the primary by itself, then tried to re-establish ha. After a 3rd reboot of the secondary and manually putting both devices on the same firmware the box randomly started functioning as designed. We checked the logs but there was no way to identify what issue was with the ports malfunctioning and we decided to just leave it and if it happened again we would get support involved while it was happening.
A few weeks later another upgrade (7.0.9) came along and we pushed it. The exact same thing happened, this time we got support on the line who wasn't able to find anything wrong with the box via the cli and kept telling us the issue was due to the ports we had for HA not being plugged in. They didn't believe us that they were plugged in until we got someone onsite to take a picture and showed all status lights were off.
The secondary box was RMA'd.
New box arrives - upgraded it to match the other unit (7.0.9). Setup HA, boxes synced no issues. Runs for a week no issues.
Then randomly we come in today and the HA is broken. It isn't just out of sync the other box isn't even showing up in the GUI or via cli (get sys ha status). So we check and again all RJ45 port's link lights were on (this is the new box that just arrived last week from the RMA). We reboot and then all ports are off.
Last thing we tried was using the actual HA port. We moved the cable over to the native HA port and configured the CLI to match - same symptoms happen but now we can see both boxes in the ha status tab as the real HA port works.
Has anyone dealt with this before?
All ports Red are actually connected and should be up
It was discovered if we reload the box while all ports are disconnected it fixes itself. Actual steps:
1. Unplug all ports
2. Reload the device
3. Once device is completely back up and running plug in the HA, waited for the primary to recognize it
4. Plug in the rest of the ports.
This tells me something in the HA control plane between the primary and the secondary is erroring out the secondary box.
I am able to recreate the issue simply be rebooting the secondary box with ports plugged in.
When the system loads the following error shows up:
Initializing firewall... System is starting... [__bsearch_index:355] entry 0x934dbc0:0x7fc28b634964 duplicated action=find-dup, vdom=root, node=system.physical-switch.port.name, key= Cannot append to table, switch_port_append,419 [__bsearch_index:355] entry 0x934dc70:0x7fc28b634964 duplicated action=find-dup, vdom=root, node=system.physical-switch.port.name, key=
'diagnose debug config-error-log read' is full of failed commands and parse errors
Again if I unplug all cables and reload none of these errors show up and the box functions as designed.
Issue has been further isolated to only surfacing when x1-x4 are utilized while the devices are in HA and the ports are in a LAG. The remote device has a separate group for each device so LACP-SLAVE-HA Disabled should not be needed.
We have migrated the links to x5-x8 (no config changes just physical connection changes) and the issue can no longer be reproduced.
Strange behaviour indeed. Have you verified the integrity of the cable being used for the HA connection; perhaps it is bad? Also it's a good idea to use two HA ports to help with HA issues when the main connection is not working properly.
When the issue happens and all the port LEDs are on what kind of things do you see in debug"
diagnose sys ha status diagnose sys ha checksum diagnose sys ha heartbeat
So I actually don't think anything is wrong with the direct HA link. I think HA is triggering a bug in the system causing the controller of the ports to error out. (sorry new to FortiNet so I don't know all the components in a FortiGate yet).
We originally had HA configured as Port1 and Port2. As none of the RJ45 ports work HA won't work on those ports. We noticed the MGMT port wasn't having any issues so we figured whatever the problem is the MGMT port is in a separate plane. Moved the HA to the built in HA port (using one of the same cables) and the HA linked but can't sync due to none of the RJ45 or fiber ports working properly.
OK If you're sure the cable is good then move on to the debug commands I posted above. And by the way there's nothing stopping you from using the regular ports for HA. In fact it is encouraged as just having one HA link is not best practice.
Another hint would be to see the output of your current HA config:
We already had HA running on two regular ports. Which is where we are running into the issue because the regular RJ45 ports are malfunctioning on reboot/upgrade.
We had the pair running in HA utilizing ports port1 and port2 as the ha heartbeat ports (instead of the native HA port).
If we use regular ports it goes like this:
Secondary Reboots > HA breaks > Secondary RJ45 ports are all lit up and not recognized > HA can't form because RJ45 ports are not recognized
If we use the native HA port it goes like this:
Secondary reboots > HA breaks > Secondary RJ45 light up/not recognized > HA forms initially but can't sync due to ports on secondary not being recognized
We've checked the physical cabling, tried different ports for HA, the secondary has already been RMA'd once. As @JamesB said above, if you reload the device as if it was its own system it loads up just fine and will join HA again.
OK you keep repeating your problems and symptoms but we can't help you if you don't provide diagnostic output or config output that has been requested. Not much more to say on this one until you can provide the requested details.
The Fortinet Security Fabric brings together the concepts of convergence and consolidation to provide comprehensive cybersecurity protection for all users, devices, and applications and across all network edges.