Skip to main content
Johan_Lysen
New Member
September 1, 2011
Question

Heartbeat device(interface) down

  • September 1, 2011
  • 9 replies
  • 14936 views
Hi I have a cluster that seams to works OK, but still i get these messages; ---------------------------------------------------- Message meets Alert condition The following critical firewall event was detected: Critical Event. date=2011-09-01 time=14:34:00 devname=SE-OSD-FGT-001 device_id=FGT60C3G10013303 log_id=0105037901 type=event subtype=ha pri=critical vd=" root" msg=" Heartbeat device(interface) down" ha_role=master hbdn_reason=neighbor-info-lost devintfname=dmz Message meets Alert condition The following critical firewall event was detected: Critical Event. date=2011-09-01 time=14:34:00 devname=SE-OSD-FGT-001 device_id=FGT60C3G10013303 log_id=0105037901 type=event subtype=ha pri=critical vd=" root" msg=" Heartbeat device(interface) down" ha_role=master hbdn_reason=neighbor-info-lost devintfname=internal4 ---------------------------------------------------- FGT60C-4.00-FW-build458-110627 HA settings looks like this on the " primary" : config system ha set group-id 7 set group-name " FGT-HA" set mode a-p <b> set hbdev " dmz" 100 " internal4" 50</b> set override disable set priority 150 set monitor " internal1" " internal2" " internal3" " wan2" end Any ideas?

    9 replies

    ede_pfau
    SuperUser
    SuperUser
    September 1, 2011
    Hi, from what it looks like the master has lost connectivity on both HA links simultaneously (' dmz' and ' internal4' ). Did you observe that the cluster has failed over? As for the reason I can only guess... - both physical connections have failed (i.e. were pulled) - quite unlikely - the master unit failed completely - FortiOS error You' re running 4.3.1, which is daring IMO. As long as you don' t find any other indication I' d bet on FortiOS failure. Some guesses: Can you observe signs that CPU and/or memory usage is exceedingly high? Did a signature update happen shortly before the HA failure? If the master unit still is alive, is the HA info synched? If the HA master has been demoted to slave now, you may reboot the unit without affecting the (live) network it is in. Depending on the HA settings it will fail over to master again after rebooting, or stay standby. IMHO you have only chances to open a support case if the behaviour is repeatable.
    Johan_Lysen
    New Member
    September 1, 2011
    Hi There is no failover involved, the diag sys top doesn´t show high cpu. We get this issue say, 1-10 times each day. no ticket created yet...
    ede_pfau
    SuperUser
    SuperUser
    September 1, 2011
    OK, so the cluster just detects that HB packets were lost but the threshold is high enough to prevent a failover. You can now - enlarge the interval the cluster members will wait until they detect a HB packet loss
     config system ha  set hb-lost-threshold 6  set hb-interval 4  end
    This increases the hb-interval from 200 ms to 400 ms. A total of 6 missed packets lead to a failover, so there' s a gap of 2.4 seconds until it triggers. - disable session-pickup if the unit is too busy. This will decrease the traffic load on the HA link substantially. That depends on the type of traffic you expect; if it' s mainly HTTP you can disable session pickup. - stop monitoring all other interfaces. The loss of the HA heartbeat will take care of a device failure. If you absolutely must monitor a link, choose just one; and traffic on it should not be too heavy. - downgrade to 4.2.x if available for the 60C. This is your weakest option IMHO. I assume that the HA link is made by a simple TP cable and not via a switch. Switches can sometimes stumble upon the ethernet type the HA traffic protocol uses.
    Johan_Lysen
    New Member
    September 2, 2011
    Hi and thx for fast answers I have done the hb-lost-threshold/hb-interval change, and also changed the number of interfaces monitored to only two, one per switch-teer (internal, internet) - so we can detect that external main internetswitch is lost and make a failover, and also if the internal main networkswitch is down. Yes we have a crossed TP cable on the DMZ port for HA traffic No we don´t use session pickup since the FG60C doesn´t have main CPU resources enough to use that. When we add session pickup we get 100% CPU usage when hitting the unit with >~100Mbps of traffic. When we disable session pickup then this issue is gone.
    Johan_Lysen
    New Member
    September 9, 2011
    Hi again There is more and more evidence that points to some issue with logging - and all other issues is because of that. miglogd runs at 25-50% cpu in average and makes all other tasks " high" - even login to WebGUI can be " down" for 15minutes some times. commands like " show log ?" hangs cli " ha-device-lost" is probably because there is no more CPU to run hatalk on. if i tries to disable all logging and make a fresh restart - everthing works pretty nice for a while (days). there is a ticket created with fortinet support, but... no
    deltasoft
    New Member
    May 4, 2012
    Hi Johan i' ve the same exact problem, any news about Fortinet support feedback? 2 x FGT60B, 4.0MR1 patch 10 Thanks a lot
    Carl_Wallmark
    New Member
    September 9, 2011
    Hi Johan, I would stay away from MR3, its not stable at all, i have seen memory leaks, log issues etc... i have heard Patch 2 is out within weeks.
    Johan_Lysen
    New Member
    September 19, 2011
    Why is it so hard to release something stable?
    Carl_Wallmark
    New Member
    September 20, 2011
    We have been asking the same for a long time, a rule of thumb: stay one MR release behind the latest. im on 4.2.8, and its very stable.
    Johan_Lysen
    New Member
    May 4, 2012
    I found out that VLAN tagged interface was not accelerated in HW. After a big fight with First Line support i got hold of second Line It then got reported as a bug a few months ago, and yesterday i was informed good news; " A fix has been developed for this issue and will be added in the future 4.0MR3P7. It should be published next week."