Skip to main content
Jirka1
Explorer II
March 7, 2026
Question

FortiGate 200F HA cluster instability after upgrade to FortiOS 7.6.6

  • March 7, 2026
  • 8 replies
  • 2308 views

 

Hello,

I would like to share an issue we are currently experiencing with a FortiGate HA cluster after upgrading to FortiOS 7.6.6. I'm also interested to know if anyone else has encountered similar behavior.

Environment:

  • 2× FortiGate 200F

  • HA cluster originally running in A-A mode

  • 550 firewall policies

  • 40k sessions/s during peak

  • UTM enabled (AV, IPS, Web Filtering, WAF)

  • Full SSL inspection

  • 500 users

  • 20 site-to-site IPsec IKEv2 tunnels

  • 50 IPsec dial-up users (IKEv2 + SAML via Entra ID + MFA)

  • FortiClient EMS 7.4.5

  • FortiAnalyzer 7.6.6

  • 3 Gbit internet connectivity

Before the upgrade:
The cluster was running on FortiOS 7.4.11 and had been completely stable for a long time.

Upgrade timeline:

  • Upgraded from 7.4.11 to 7.6.6 on Saturday (21 Feb 2026).

  • Everything initially appeared to be working correctly.

  • On Monday around 11:00, when normal traffic load started, the first cluster failure occurred.

Observed symptoms:
The failures appear to start with HA communication issues (HA ports dropping). Shortly after that the cluster becomes unstable and crashes.

Crash logs repeatedly show errors such as:

EXT2-fs (sda1): previous I/O error to superblock detected
application hasync signal 11 (Segmentation fault) received

After the first crash we performed extensive troubleshooting including:

  • full TAC diagnostics

  • formatting both units

  • reinstalling firmware via TFTP

  • restoring configuration

Despite these steps the problem persisted.

RMA:

Fortinet approved an RMA and we replaced one of the units. The replacement device was upgraded to 7.6.6, added back to the cluster, and synchronization completed successfully.

However, the cluster failed again the next morning under load.

 

Additional testing:
To continue testing we:

  • reintroduced the cluster in A-A mode

  • reduced the number of WAD processes to 2

  • installed IPS engine version 7.1176

The cluster still crashed again later.

Current situation:
At the moment the only way to keep the environment stable is to break the HA cluster and run a single standalone unit.

Interestingly, the system behaves normally when traffic load is low (for example during evenings or weekends). The crashes consistently appear during business hours when the traffic load increases.


Questions:

  1. Has anyone experienced similar HA instability on FortiOS 7.6.6 with 200F/201F devices?

  2. Has anyone seen hasync segmentation faults?

We have been working with TAC on this issue for almost two weeks, but so far there has been no clear conclusion and we are mostly being asked to repeatedly provide additional debug outputs.


Thanks.
Jirka

8 replies

BillH_FTNT
Staff
Staff
March 8, 2026

Hi @Jirka1 

Could you please share the logs with me? I would like to replicate the issue in my lab.

It would be a great help if you could provide the following:

  • The crash log from the console (also known as comlog or console log)
  • The output of dia debug crashlog read
  • TAC report
  • System and event logs (before, during, and after the issue)
  • The full configuration, or if that is not possible, please share the HA configuration
  • The ticket number, if you have one

My name is Bill, and I am from Fortinet. If possible, please send the logs/configuration to my email: bhoang@fortinet.com.

Thank you.

 

Bill

Jirka1
Jirka1Author
Explorer II
March 8, 2026

Hi Bill,

you already replied to me on Reddit. The ticket number is #11623283.

 
Jirka
 
Jirka1
Jirka1Author
Explorer II
March 12, 2026

Update / additional findings:

After further testing we discovered that the crashes are may not related to HA.

The unit crashes even when running as a standalone device under normal production load.

We captured the console output immediately after the crash and the kernel log repeatedly shows memory allocation failures affecting multiple FortiGate processes.

Examples from the console crash output:

page allocation failure: order:1, mode:0x20

Processes involved:
- wad
- forticron
- miglogd
- cw_acd / cw_acd_helper
- lldp-manager
- kswapd0

Many of the traces originate from network buffer allocation:

__alloc_skb
dev_alloc_skb
np6xlite_hif_xmit [filter4]


The system eventually enters heavy memory pressure and starts failing page allocations across multiple processes.

Mem-Info snapshot at the time of crash shows extremely low free memory in the Normal zone:

Normal free: ~11 MB
min: ~7 MB
low: ~9 MB

After this state the system becomes unstable and crashes.

The failures consistently appear when the traffic load increases (around 30-40k sessions/s peak). Under low traffic conditions the device can run for days without problems.

This seems to indicate that the issue is triggered by memory pressure related to packet buffer allocation.


Jirka

BillH_FTNT
Staff
Staff
March 12, 2026

Hi Jirka1

 

Our Engineering team is currently investigating your issue. We will provide updates through the support ticket, or I will share updates here as soon as new information becomes available. Thank you
Bill
Kangming
Staff
Staff
March 12, 2026

Hi Jirka1,

I don't think the FortiGate 200F with 8 CPUs and 8GB of RAM is performance capable of doing the following.

  • UTM enabled (AV, IPS, Web Filtering, WAF)

  • Full SSL inspection

  • 3 Gbit internet connectivity

>>The cluster was running on FortiOS 7.4.11 and had been completely stable for a long time.

>>7.6.6  On Monday around 11:00, when normal traffic load started, the first cluster failure occurred.

 

Compared to 7.4.11GA, has the traffic increased significantly?

 

Is there a historical record of this traffic compare 7.4 and 7.6?

 

Even with 7.2/7.4, our FGT200F in our QA lab cannot handle 3G Full SSL inspection UTM traffic. 

We need more traffic/memory related data to confirm the memory issue.

The following information will be very helpful. For example, executing the command once when memory is initially at 60%, and then executing it again when memory reaches 70%, comparing these multiple memory info executions, will make it very easy to see where memory is being consumed.

 

# diagnose debug report
# execute tac report

# get system performance status

# diagnose sys session full-stat

# diagnose hardware sysinfo memory
# diagnose hardware sysinfo slab

# diagnose sys top 1 50 3

# diagnose sys top-mem 50

# diagnose sys profile report

# diagnose netlink interface packet-rate

# fnsysctl df -k


There is an even better way to collect memory data, the autoscript can be continuously executed to collect memory information, ensuring that no memory changes are missed.


If we can run the following automated script, we will be able to find the root cause of the memory exhaustion immediately.

|--Monitoring_script_for_memory_entered_conserve_mode_2025_12_19_memory_autoscript.zip

 

Environment:

  • 2× FortiGate 200F

  • HA cluster originally running in A-A mode

  • 550 firewall policies

  • 40k sessions/s during peak

  • UTM enabled (AV, IPS, Web Filtering, WAF)

  • Full SSL inspection

  • 500 users

  • 20 site-to-site IPsec IKEv2 tunnels

  • 50 IPsec dial-up users (IKEv2 + SAML via Entra ID + MFA)

  • FortiClient EMS 7.4.5

  • FortiAnalyzer 7.6.6

  • 3 Gbit internet connectivity

 

 
 
 
 
Jirka1
Jirka1Author
Explorer II
March 12, 2026

Hi @Kangming,

let me clarify a two things:

- 3 Gbit internet connectivity does not mean 3 Gbit of traffic. The average daily traffic is around 300 Mbit/s. Please see the attached graph from our monitoring.

WAN traffic:

fgt_wan.png


Performance (CPU, Memory, Sessions):
fgt-perfo.png
- No, there has been no increase in traffic volume or change in the type of traffic. Everything has remained constant.


At the moment I honestly ran out of patience with this issue and I rolled everything back to FortiOS 7.4.11. I will monitor the system and see how it behaves.

Regarding the script you shared — thank you, I appreciate it.

It is just a bit unfortunate that during the three weeks of working with TAC we were never asked to run this type of diagnostics. So far we have mostly been sending and explaining the same things repeatedly over and over :(


Jirka

Kangming
Staff
Staff
March 12, 2026

Hi @Jirka1

>>>- 3 Gbit internet connectivity does not mean 3 Gbit of traffic. The average daily traffic is around 300 Mbit/s. Please see the attached graph from our monitoring.

 

Thank you for your clarification. This information is crucial and prevents any misunderstanding.

 

358Mbit/s is the accurate peak traffic.

 

This is the significance of collecting memory/performance commands during periods of high memory usage.

 

### get system performance status

 

|-- primary-tac-report.txt

Memory: 8186784k total, 5251432k used (64.1%), 2430104k free (29.7%), 505248k freeable (6.2%)

 

|-- secondary-tac-report.txt

Memory: 8170400k total, 4479124k used (54.8%), 3184172k free (39.0%), 507104k freeable (6.2%)

 

This is the information we see when uploading files in the backend. The traffic and memory usage are not high, so we cannot determine exactly what happened at that high memory time. 

 

It appears that the memory was used up in a very short time, without even entering memory conservation mode.

 

We cannot confirm which process suddenly consumed all the memory. 

 

|-- Performance (CPU, Memory, Sessions):

>>>No, there has been no increase in traffic volume or change in the type of traffic. Everything has remained constant.

 

These 2 pieces of information are also very important.

 

It would be even better if we could provide an accurate daily traffic(Only one day is needed, and the exact time need be provided.), so we could clearly see when the daily traffic peak begins, what the exact peak traffic volume is, the corresponding memory and CPU consumption, and the CC and CPS.

 

The image shows that memory usage consistently stays above 50%, reaching 60% to 70% during peak periods. This is the performance on V7.4.11GA. 

 

On average, CPU usage is between 40% and 50%.

 

The overall traffic is approximately between 50M and 350M.

 

One predictable behavior is that with UTM enabled, memory consumption in version 7.6 will be approximately 5% to 7% higher than in version 7.4, due to the addition of new features and changes to the IPS engine.

 

Therefore, memory usage is expected to increase from 50% to around 55%, even with the initial memory usage. The maximum expected memory consumption is approximately 75%, assuming everything goes smoothly. 

 

It is obvious that one of the processes is malfunctioning and consuming excessive memory. At this time, we are unable to obtain the specific process and the root cause through the provided logs and information.  === > This is the clue we need to find so that we can assign the problem to the appropriate DEV team, such as the IPS team, the AV team, or other relevant and accurate DEVs to provide support.  We need to know precisely whether this is the expected behavior or if a certain process actually has a bug causing abnormal memory consumption. 

 

Of course, if the information from the customer side cannot accurately find specific clues, we will try to reproduce it and will do a comparative test in the lab between build 7.4 and 7.6, using the same configuration file to simulate IMXID traffic as closely as possible, and compare the results to see if the problem can be reproduced in our LAB env.

Thank you for your understanding. We will handle it ASAP.

 
 
 
 
dalmiroy2k
Visitor III
March 12, 2026

I know it's not the solution you are looking for, but for now I would rollback to 7.4.11 inmediately, unless there is a 7.6.x feature you really need.

I agree with Kangming that FG-200F may not be enough for that settings, at least not in 7.6.

johnlloyd_13
Explorer III
March 19, 2026

hi,

seems scary to upgrade to 7.6.6.

OP HA seem stable in 7.4 until it was upgraded to 7.6.6.

it's frustrating how TAC poorly handles these kind of critical bugs or issues.

 

it32
New Member
May 4, 2026

​@Jirka1 Did you have any updates about your problem?

Did you rollback to 7.4.11?

Jirka1
Jirka1Author
Explorer II
May 4, 2026

Hey ​@it32 
unfortunately not.
And yes, we had to go back to 7.4.11, where everything works perfectly.
The last information I have from TAC is about 14 days old, that they are trying to reproduce our problem.

Jirka

it32
New Member
May 4, 2026

​@Jirka1  Thank you very much for your prompt reply, We have various models from 100F to 600E and we are very sceptical in what version we will follow, Currently we are in 7.2.11 + 7.2.13 with no major problem. We upgraded to VM version to 7.4.11 and broke ikev2 with 2FA Forticlient.

I hope they will find a solution.. 

Last question any particular reason from going to 7.6.6?

 

Jirka1
Jirka1Author
Explorer II
May 4, 2026

​@it32 
The main reason for testing the move to 7.6.6 was the native Microsoft PureView Connector for TLP labels and integration with DLP sensors and a detailed AI sensor.

Jirka