Skip to main content
deltasoft
New Member
June 19, 2012
Question

Unexpected HA failover issues

  • June 19, 2012
  • 6 replies
  • 21221 views
Hello all, i have an issue with two Fortigate 60B configured in HA active-passive mode heartbeat interfaces: - WAN1 connected through a switch with dedicated vlan ports (untagged) - WAN2 connected directly with a cross cable Randomly several times a day the cluster start an HA failover with these logs: Message meets Alert condition The following critical firewall event was detected: Critical Event. date=2012-04-13 time=22:49:27 devname=company-fw2 device_id=FGT60B3908650580 log_id=0105037901 type=event subtype=ha pri=critical fwver=040010 vd=" root" msg=" Heartbeat device(interface) down" ha_role=slave hbdn_reason=neighbor info lost devintfname=wan2 Message meets Alert condition The following critical firewall event was detected: Critical Event. date=2012-04-13 time=22:49:27 devname=company-fw2 device_id=FGT60B3908650580 log_id=0105037901 type=event subtype=ha pri=critical fwver=040010 vd=" root" msg=" Heartbeat device(interface) down" ha_role=slave hbdn_reason=neighbor info lost devintfname=wan1 Message meets Alert condition The following critical firewall event was detected: Critical Event. date=2012-04-13 time=22:49:28 devname=company-fw1 device_id=FGT60B3908670675 log_id=0105037901 type=event subtype=ha pri=critical fwver=040010 vd=" root" msg=" Heartbeat device(interface) down" ha_role=master hbdn_reason=neighbor info lost devintfname=wan2 Message meets Alert condition The following critical firewall event was detected: Critical Event. date=2012-04-13 time=22:49:28 devname=company-fw1 device_id=FGT60B3908670675 log_id=0105037901 type=event subtype=ha pri=critical fwver=040010 vd=" root" msg=" Heartbeat device(interface) down" ha_role=master hbdn_reason=neighbor info lost devintfname=wan1 - no power outage (firewalls and swithes are connected to an ups, switches are always online) - no switch problems (no evidence of restart or problems in their logs) I' ve tried to enable alternatively only one heartbeat interface, first wan1 then wan2, with no success. When the HA failover occurr, clients inside lan lost their internet connection and all vpn tunnels are brought down causing big connectivity troubles Initially there was only one firewall connected, working perfectly. When i added the second firewall in HA mode the problems started immediatley. In the past I' ve configured several others units in HA mode with no problems. I cannot explain the reason of this malfunctioning. I opened a support ticket more than one month ago, only to discovered that the technical support is very poor (one answer every 4-5 days) and it' s totally useless because they don' t have any idea how to solve the problem. Thanks in advance for your support, you' re my last chance :)

    6 replies

    Matthijs
    New Member
    June 19, 2012
    Please login to the cli and type the following:
      config system ha  show full  
    It might have something todo with the timers for HA. Do you monitor the CPU usage of the units? Maybe there is a problem with one of the units causing a high cpu load? What software version do you use?
    deltasoft
    deltasoftAuthor
    New Member
    June 19, 2012
    Hi Matthijs this is the output: config system ha set group-id 0 set group-name " IG-HA" set mode a-p set password ENC xxxxxxxxxxxxxxxxxxxxxxxxxxx set hbdev " wan1" 50 " wan2" 100 set route-ttl 10 set route-wait 0 set route-hold 10 set sync-config enable set encryption disable set authentication disable set hb-interval 4 set hb-lost-threshold 20 set helo-holddown 20 set arps 5 set arps-interval 8 set session-pickup disable set link-failed-signal disable set uninterruptable-upgrade enable set vcluster2 disable set override disable set priority 128 unset monitor unset pingserver-monitor-interface set pingserver-failover-threshold 0 set pingserver-flip-timeout 60 end sw version 4.0 MR1 Patch 10, sorry i' ve missed that According to tech support i' ve already modified the timers this way: config system ha set hb-lost-threshold 6 --> to 20 set hb-interval 2 --> to 4 end but with no success. Monitoring did not show high cpu usage. Session pickup,initially enabled, has been disabled. Fortinet scheduled update has been set once a day out of working hours (3:00 am) Tnx
    ede_pfau
    SuperUser
    SuperUser
    June 19, 2012
    Why do you use interface monitoring at all? You can achieve unit failure detection with HA heartbeats alone. If your WAN device fails you can detect that via Gateway detection and re-route. Which FortiOS version?
    deltasoft
    deltasoftAuthor
    New Member
    June 19, 2012
    Hi ede_pfau, so do you suggest to disable monitoring? Do you think this could be solve my problem and prevent further HA failovers? OS 4.0 MR1 P10 Tnx
    deltasoft
    deltasoftAuthor
    New Member
    June 19, 2012
    About units usage: small office, ~40 users mainly web and email traffic only Fortinet web filtering enabled, no antivirus/antispam, default ips 3 ipsec vpn tunnels, 2 lan to lan and 1 roadwarrior I was thinking to upgrade the fw to 4.0 MR2 or MR3 latest patch to see if the problem could be solved, last week i' ve asked the tech support about this with no answer :(
    ede_pfau
    SuperUser
    SuperUser
    June 19, 2012
    OK so you have HA linked the FGTs on WAN2 (primarily) and WAN1. You do not monitor any link failure apart from these. So no need to change the configuration. To me the configuration looks correct. If even increasing the timeout period doesn' t help...I would consider upgrading to v4.00 MR2 patch 12. To be honest I don' t see any striking errors which could solve the problem immediately. So, upgrading as a last resort. Please note that there have been changes from 4.1 to 4.2, please read the Release Notes carefully! Also, it would help if you are onsite while upgrading. If the cluster breaks during the upgrade there is a risk of loosing the slave.
    deltasoft
    deltasoftAuthor
    New Member
    June 19, 2012
    Hi ede_pfau
    Please note that there have been changes from 4.1 to 4.2, please read the Release Notes carefully! Also, it would help if you are onsite while upgrading. If the cluster breaks during the upgrade there is a risk of loosing the slave.
    What are the relevant changes to pay attention in your opinion? I can manage 1-2 hours of internet outage, scheduling it at the end of working period I was thinking to upgrade according to these steps: 1) break the cluster and disconnect the 2nd unit 2) save the config 3) reset the 2nd unit to factory defaults 4) upgrade the 2nd unit to 4.0 MR2 5) reload the config to the 2nd unit 6) physically switch the two units 7) reset the 1st unit to factory defaults 8) upgrade the 1st unit to 4.0 MR2 9) connect the 1st unit and rejoin the cluster Do you think it' s a correct upgrade plan?
    rwpatterson
    New Member
    June 19, 2012
    Why don' t you just try to upgrade them as a stack? If it fails, then you could break the stack and do it individually. Those issues may just be that the firmware level has problems with HA. Also, and this is important: If you upgrade the unit in the factory default state, you cannot safely restore the older config. Each backup is only right for the level of code it was made from. You should load the same older version, then restore the config and upgrade with the config in place. This will give you the best chance at a working system after it' s all done.
    ede_pfau
    SuperUser
    SuperUser
    June 19, 2012
    I can only support what Bob (rwpatterson) has posted. Keep it simple. Regarding the changes, all details are in the RN. Judge for yourself if they are important to your setup.
    deltasoft
    deltasoftAuthor
    New Member
    June 19, 2012
    Ok thanks all, l' ll let you know What about upgrading to 4.0 MR3? Is it yet ready for production?
    rwpatterson
    New Member
    June 19, 2012
    Personally, I would stick with MR2. Get the kinks ironed out, then move up if you feel you need the newer features.
    deltasoft
    deltasoftAuthor
    New Member
    June 23, 2012
    Ok, yesterday i' ve upgraded the firmware of the cluster to 4.0 MR2 Patch 12 I' ll wait to see if the problem still persist. In the meantime there' s a new problem, the units do not synchronize between them. Here is the console log: company-fw2 login: slave' s external files are not in sync with master, sequence:0. (type CERT_LOCAL) slave' s external files are not in sync with master, sequence:0. (type CERT_LOCAL) slave' s external files are not in sync with master, sequence:1. (type CERT_LOCAL) slave' s external files are not in sync with master, sequence:2. (type CERT_LOCAL) slave' s external files are not in sync with master, sequence:3. (type CERT_LOCAL) slave' s external files are not in sync with master, sequence:4. (type CERT_LOCAL) slave succeeded to sync external files with master slave' s configuration is not in sync with master' s, sequence:0 slave' s configuration is not in sync with master' s, sequence:1 slave' s configuration is not in sync with master' s, sequence:2 slave' s configuration is not in sync with master' s, sequence:3 slave' s configuration is not in sync with master' s, sequence:4 slave starts to sync with master logout all admin users slave failed to sync with master, will try again in a moment slave' s configuration is not in sync with master' s, sequence:0 slave' s configuration is not in sync with master' s, sequence:1 slave' s configuration is not in sync with master' s, sequence:2 slave' s configuration is not in sync with master' s, sequence:3 slave' s configuration is not in sync with master' s, sequence:4 slave starts to sync with master logout all admin users slave failed to sync with master, will try again in a moment slave' s configuration is not in sync with master' s, sequence:0 slave' s configuration is not in sync with master' s, sequence:1 slave' s configuration is not in sync with master' s, sequence:2 slave' s configuration is not in sync with master' s, sequence:3 slave' s configuration is not in sync with master' s, sequence:4 slave starts to sync with master logout all admin users slave failed to sync with master, will try again in a moment here the synchronization stop, to resume it i need to reboot the slave unit, but the problem still persist: company-fw2 login: admin Password: ********* Welcome ! company-fw2 # execute reboot This operation will reboot the system ! Do you want to continue? (y/n)y The system is going down NOW !! System is rebooting... company-fw2 # Please stand by while rebooting theFGT60B (15:29-09.06.2007) Ver:04000006 Serial number:FGT60B9999999999 RAM activation Total RAM: 256MB Enabling cache...Done. Scanning PCI bus...Done. Allocating PCI resources...Done. Enabling PCI resources...Done. Zeroing IRQ settings...Done. Verifying PIRQ tables...Done. Boot up, boot device capacity: 64MB. Press any key to display configuration menu... ...... Reading boot image 1817002 bytes. Initializing firewall... System is started. company-fw2 login: slave' s external files are not in sync with master, sequence:0. (type CERT_LOCAL) slave' s external files are not in sync with master, sequence:1. (type CERT_LOCAL) slave' s external files are not in sync with master, sequence:2. (type CERT_LOCAL) slave' s external files are not in sync with master, sequence:3. (type CERT_LOCAL) slave' s external files are not in sync with master, sequence:4. (type CERT_LOCAL) slave succeeded to sync external files with master slave' s configuration is not in sync with master' s, sequence:0 slave' s configuration is not in sync with master' s, sequence:1 slave' s configuration is not in sync with master' s, sequence:2 slave' s configuration is not in sync with master' s, sequence:3 slave' s configuration is not in sync with master' s, sequence:4 slave starts to sync with master logout all admin users slave failed to sync with master, will try again in a moment slave' s configuration is not in sync with master' s, sequence:0 slave' s configuration is not in sync with master' s, sequence:1 slave' s configuration is not in sync with master' s, sequence:2 slave' s configuration is not in sync with master' s, sequence:3 slave' s configuration is not in sync with master' s, sequence:4 slave starts to sync with master logout all admin users slave failed to sync with master, will try again in a moment slave' s configuration is not in sync with master' s, sequence:0 slave' s configuration is not in sync with master' s, sequence:1 slave' s configuration is not in sync with master' s, sequence:2 slave' s configuration is not in sync with master' s, sequence:3 slave' s configuration is not in sync with master' s, sequence:4 slave starts to sync with master logout all admin users slave failed to sync with master, will try again in a moment and the synchronization stops again. I' ve followed this kb: http://kb.fortinet.com/kb/microsites/search.do?cmd=displayKC&docType=kc&externalId=FD31379&sliceId=1&docTypeID=DT_KCARTICLE_1_1&dialogID=34334483&stateId=0 0 34336209 without success, the command " execute ha synchronize config" does not start any synchronization. I didn' t restart the primary unit yet, i' m afraid that if i will restart it i will lose access to the cluster because i don' t know how correctly the slave unit will work.
    rwpatterson
    New Member
    June 24, 2012
    When you log into the slave, check the firmware version. Make sure it upgraded as well.
    deltasoft
    deltasoftAuthor
    New Member
    June 25, 2012
    It seems all ok: company-fw1 # get system status Version: Fortigate-60B v4.0,build0346,120606 (MR2 Patch 12) Virus-DB: 15.00748(2012-06-24 15:28) Extended DB: 15.00734(2012-06-22 07:33) IPS-DB: 3.00203(2012-06-20 22:19) FortiClient application signature package: 1.503(2012-06-22 17:58) Serial-Number: FGT60XXXXXXXXXX BIOS version: 04000009 Log hard disk: Not available Internal Switch mode: switch Hostname: company-fw1 Operation Mode: NAT Current virtual domain: root Max number of virtual domains: 10 Virtual domains status: 1 in NAT mode, 0 in TP mode Virtual domain configuration: disable FIPS-CC mode: disable Current HA mode: a-p, master Distribution: International Branch point: 346 Release Version Information: MR2 Patch 12 System time: Mon Jun 25 11:36:55 2012 company-fw1 # execute ha manage 1 company-fw2 $ get system status Version: Fortigate-60B v4.0,build0346,120606 (MR2 Patch 12) Virus-DB: 15.00748(2012-06-24 15:28) Extended DB: 15.00734(2012-06-22 07:33) IPS-DB: 3.00203(2012-06-20 22:19) FortiClient application signature package: 1.503(2012-06-22 18:05) Serial-Number: FGT60BYYYYYYYYYYYY BIOS version: 04000006 Log hard disk: Not available Internal Switch mode: switch Hostname: company-fw2 Operation Mode: NAT Current virtual domain: root Max number of virtual domains: 10 Virtual domains status: 1 in NAT mode, 0 in TP mode Virtual domain configuration: disable FIPS-CC mode: disable Current HA mode: a-p, backup Distribution: International Branch point: 346 Release Version Information: MR2 Patch 12 System time: Mon Jun 25 11:37:20 2012 Only the BIOS version it' s different, do you think it' s a problem?