Support Forum
The Forums are a place to find answers on a range of Fortinet products from peers and product experts.
johnwcalder
New Contributor

Fortiweb-1000E 6.4.1 secondary HA unit loses network connection intermittently

After upgrading from a previous firmware version to 6.4.1, the secondary unit in the Fortiweb HA cluster becomes intermittently unavailable. If I block all ports except the management port then I have no trouble logging in, but then obviously this means no HA. Could some sort of bug or misconfiguration be causing the secondary unit to reboot or reset its network services? Is there a way to check physical uptime?

5 REPLIES 5
johnwcalder
New Contributor

After inspecting some logs on the Fortianalyzer I found that it seems to join the HA cluster for 4 minutes, then spend about 8-10 minutes offline before joining again.

ddsouza_FTNT
Staff
Staff

 @johnwcalder ,  I could think of two possible reasons.

1) HA sync problem: Check the status of the Secondary node. If it gets stuck at 'INIT', then it means there is a problem with HAsunc. And when it happens, by design, Master will push the configuration to slave constantly until config gets synced (there are many files synced too, for example, AV DB, Signature DB, and so on, but most often config creates the problem) and slave constantly loads configuration file and resulting in the secondary node leaving from HA in the event logs of Primary node and then rejoining. You lose access to the secondary unit while it is reloading.

 

Please follow the steps mentioned below to narrow down the problem.

1.a - Check whether System resource usage is high on member units (primary and secondary).

get system performance
diag sys top

 

1.b - Check whether hasync daemon on any member units keeps on crashing. You can run the following command to verify. If you see the PID keeps on changing for the hasync process, then it indicates there is a problem with it (although no crash log is generated).

 

diagnose debug crashlog show
fn ps (use default 'admin' account to login to the unit and run this command multiple times and see whether pid of hasync daemon keep on crashing)

if it is download the System debug file (GUI->System->Maintenance->Debug->Debug Log) and contact Tech support.

 

1.c - If you don't see any problem mentioned in 1.a and 1.b, collect the following from both units, then the problem could be some part of Config doesn't get synced.See whether running the following command on Master helps.

execute ha synchronize all

if the slave device is still not in sync, then please follow the steps mentioned below and then compare the ha config files (ha_config_1.tgz and ha_config_2.tgz) to see which part doesn't get synced or contact Tech support with the logs.


Execute the following commands on master(NodeA) SSH1:
fn ps
get system status
get system performance
exe ha md5sum
diagnose debug enable
diagnose debug console enable
diagnose debug application confd-hamsg 6
diagnose system ha status
diagnose system ha confd_status
diagnose system ha sync-config get-status
fn ps

Execute the following commands on slave(NodeB) SSH1 :
fn ps
get system status
get system performance
exe ha md5sum
diagnose debug enable
diagnose debug application confd-hamsg 6
diagnose debug console enable
diagnose system ha status
diagnose system ha confd_status
diagnose system ha sync-config get-status
fn ps


Execute the following command on master(NodeA) SSH2:
diag network sniffer <heartbeat_interface> "" 6 0 a


Execute the following command on master(NodeB) SSH2:
diag network sniffer <heartbeat_interface> "" 6 0 a


Make a change in the configuration on the master, for example, admin idle timeout:

config system global
set admintimeout 360 <<< change the admin timeout to 414 seconds or any random value.
end


Execute the following instructions on master SSH1:

diagnose system ha file-stat
(Wait 1 minutes later)
diagnose system ha status
diagnose system ha confd_status
diagnose system ha backup-config 1
diagnose system ha backup-config 2
diagnose debug application confd-hamsg 0


Execute the following instructions on slave SSH1:

diagnose system ha file-stat
(Wait 1 minutes later)
diagnose system ha status
diagnose system ha confd_status
diagnose debug application confd-hamsg 0


Execute the following commands on master and slave SSH1:
diag sys top -> then press button 1
diag sys perf -> then press button K
fn netstat -antl
fn ps

Download the ha_config files from the master unit.

a.Enable file upload option in the CLI.
config system settings
set enable-file-upload enable
end

b.Please download it from System->Maintenance->Backup&Restore by GUI

GUI->System->Maintenance->Backup&restore->GUI file download/upload

ha_config_1.tgz and ha_config_2.tgz. Unzip and compare these files.

backup entire configuration from both units
GUI->System->Maintenance->Backup&restore->local backup -> backup entire configuration

Export Debug log files from both units
GUI->System->Maintenance->Debug->Debug Log


OR

2) Some system-related issues on the Secondary unit: Please raise a ticket and attach the debug file GUI->System->Maintenance->Debug->Debug Log.

 

ddsouza_FTNT
Staff
Staff

Hi @johnwcalder ,

Good day. I could think of two possible reasons...

1) HA sync problem: Check the status of the Secondary node. If it gets stuck at 'INIT', then it means there is a problem with HAsunc. And when it happens, by design, Master will push the configuration to slave constantly until config gets synced (there are many files synced too, for example, AV DB, Signature DB, and so on, but most often config creates the problem) and slave constantly loads configuration file and resulting in the secondary node leaving from HA in the event logs of Primary node and then rejoining. You lose access to the secondary unit while it is reloading.

 

Please follow the steps mentioned below to narrow down the problem.

1.a - Check whether System resource usage is high on member units (primary and secondary).

get system performance
diag sys top

 

1.b - Check whether hasync daemon on any member units keeps on crashing. You can run the following command to verify. If you see the PID keeps on changing for the hasync process, then it indicates there is a problem with it (although no crash log is generated).

diagnose debug crashlog show
fn ps (use default 'admin' account to login to the unit and run this command multiple times and see whether pid of hasync daemon keep on crashing)

if it is download the System debug file (GUI->System->Maintenance->Debug->Debug Log) and contact Tech support.

 

1.c - If you don't see any problem mentioned in 1.a and 1.b, collect the following from both units, then the problem could be some part of Config doesn't get synced.See whether running the following command on Master helps.

execute ha synchronize all

if the slave device is still not in sync, then please follow the steps mentioned below and then compare the ha config files (ha_config_1.tgz and ha_config_2.tgz) to see which part doesn't get synced or contact Tech support with the logs.


Execute the following commands on master(NodeA) SSH1:
fn ps
get system status
get system performance
exe ha md5sum
diagnose debug enable
diagnose debug console enable
diagnose debug application confd-hamsg 6
diagnose system ha status
diagnose system ha confd_status
diagnose system ha sync-config get-status
fn ps

Execute the following commands on slave(NodeB) SSH1 :
fn ps
get system status
get system performance
exe ha md5sum
diagnose debug enable
diagnose debug application confd-hamsg 6
diagnose debug console enable
diagnose system ha status
diagnose system ha confd_status
diagnose system ha sync-config get-status
fn ps


Execute the following command on master(NodeA) SSH2:
diag network sniffer <heartbeat_interface> "" 6 0 a


Execute the following command on master(NodeB) SSH2:
diag network sniffer <heartbeat_interface> "" 6 0 a


Make a change in the configuration on the master, for example, admin idle timeout:

config system global
set admintimeout 360 <<< change the admin timeout to 360 minutes or any random value.
end


Execute the following instructions on master SSH1:

diagnose system ha file-stat
(Wait 1 minutes later)
diagnose system ha status
diagnose system ha confd_status
diagnose system ha backup-config 1
diagnose system ha backup-config 2
diagnose debug application confd-hamsg 0


Execute the following instructions on slave SSH1:

diagnose system ha file-stat
(Wait 1 minutes later)
diagnose system ha status
diagnose system ha confd_status
diagnose debug application confd-hamsg 0


Execute the following commands on master and slave SSH1:
diag sys top -> then press button 1
diag sys perf -> then press button K
fn netstat -antl
fn ps

Download the ha_config files from the master unit.

a.Enable file upload option in the CLI.
config system settings
set enable-file-upload enable
end

b.Please download it from System->Maintenance->Backup&Restore by GUI

GUI->System->Maintenance->Backup&restore->GUI file download/upload

ha_config_1.tgz and ha_config_2.tgz. Unzip and compare these files.

backup entire configuration from both units
GUI->System->Maintenance->Backup&restore->local backup -> backup entire configuration

Export Debug log files from both units
GUI->System->Maintenance->Debug->Debug Log


OR

2) Some system-related issues on the Secondary unit: Please raise a ticket and attach the debug file GUI->System->Maintenance->Debug->Debug Log.

 

ddsouza_FTNT
Staff
Staff

For some reason, I don't see my reply in the post, so I am adding it again.

The behavior you have observed indicates a problem with the HAsync. Please check the status of the Secondary node . You can check the status on the Primary (system high Availbility>Settings) when the slave device is still connected to the primary via HA) . Usually, the status of the slave shows as 'INIT' in this scenario, which means there is a problem with HAsync. And when it happens, by design, Master will push the configuration to slave constantly until config gets synced (there are many files synced too, for example, AV DB, Signature DB, and so on, but most often config creates the problem) and slave constantly loads configuration file and resulting in the secondary node leaving from HA in the event logs of Primary node and then rejoining. You lose access to the secondary unit while it is reloading.


Please follow the steps mentioned below to narrow down the problem.

1.a - Check whether System resource usage is high on member units (primary and secondary).

get system performance
diag sys top

 

1.b - Check whether hasync daemon on any member units keeps on crashing. You can run the following command to verify. If you see the PID keeps on changing for the hasync process, then it indicates there is a problem with it (although no crash log is generated).

diagnose debug crashlog show
fn ps (use default 'admin' account to login to the unit and run this command multiple times and see whether pid of hasync daemon keep on crashing)

If it is download the System debug file (GUI->System->Maintenance->Debug->Debug Log) and contact Tech support.

 

1.c - If you don't see any problem mentioned in 1.a and 1.b, collect the following from both units, then the problem could be some part of Config doesn't get synced.See whether running the following command on Master helps.

execute ha synchronize all

 

If the slave device is still not in sync, then please follow the steps mentioned below and then compare the ha config files (ha_config_1.tgz and ha_config_2.tgz) to see which part doesn't get synced or contact Tech support with the logs.

***Execute the following commands on master(NodeA) SSH1:
fn ps
get system status
get system performance
exe ha md5sum
diagnose debug enable
diagnose debug console enable
diagnose debug application confd-hamsg 6
diagnose system ha status
diagnose system ha confd_status
diagnose system ha sync-config get-status
fn ps


***Execute the following commands on slave(NodeB) SSH1 :
fn ps
get system status
get system performance
exe ha md5sum
diagnose debug enable
diagnose debug application confd-hamsg 6
diagnose debug console enable
diagnose system ha status
diagnose system ha confd_status
diagnose system ha sync-config get-status
fn ps


***Execute the following command on master(NodeA) SSH2:
diag network sniffer <heartbeat_interface> "" 6 0 a
Execute the following command on master(NodeB) SSH2:
diag network sniffer <heartbeat_interface> "" 6 0 a


***Make a change in the configuration on the master, for example, admin idle timeout:
config system global
set admintimeout 360 <<< change the admin timeout to 414 seconds or any random          value.
end


***Execute the following instructions on master SSH1:

diagnose system ha file-stat
(Wait 1 minutes later)
diagnose system ha status
diagnose system ha confd_status
diagnose system ha backup-config 1
diagnose system ha backup-config 2
diagnose debug application confd-hamsg 0


***Execute the following instructions on slave SSH1:
diagnose system ha file-stat
(Wait 1 minutes later)
diagnose system ha status
diagnose system ha confd_status
diagnose debug application confd-hamsg 0


***Execute the following commands on master and slave SSH1:
diag sys top -> then press button 1
diag sys perf -> then press button K
fn netstat -antl
fn ps

***Download the ha_config files from the master unit.

a.Enable file upload option in the CLI.
config system settings
set enable-file-upload enable
end

b.Please download it from System->Maintenance->Backup&Restore by GUI

GUI->System->Maintenance->Backup&restore->GUI file download/upload

ha_config_1.tgz and ha_config_2.tgz. Unzip and compare these files.

backup entire configuration from both units
GUI->System->Maintenance->Backup&restore->local backup -> backup entire configuration

 

*** Export Debug log files from both units
GUI->System->Maintenance->Debug->Debug Log

 

ddsouza_FTNT
Staff
Staff

I understand it is a bit challenging to collect the logs from the slave in those 4 minutes. At least, you can collect ha_config_1.tgz and ha_config_2.tgz from the master nodes which can be used later to verify which part of the configuration is not in sync.

Labels
Top Kudoed Authors