Technical Tip: Combining Remote Link Monitoring with FGCP cluster High Availability

mmaubert · ‎07-24-2019

Description

This article describes how to use and configure Remote Link Monitoring in combination with an FGCP HA cluster.

Link Monitoring can be combined with an FGCP cluster in order to detect a local link failure and thus ensure the high availability of a cluster.

Remote Link Monitoring can be used to detect a remote failure, either on a remote link or remote equipment, and potentially trigger a cluster failover to avoid a traffic interruption.

Scope

FortiGate.

Solution

Basic steps to implement a Remote Link Monitor:

Select one or more network servers or services located downstream of the cluster.
Configure the cluster to periodically monitor access to those network resources.
Trigger a cluster failover in case the primary unit loses access to those resources.

There is no real limitation on how Remote Link Monitoring can be applied. It can either be implemented in conjunction with Port Monitoring or standalone mode. It can be composed of one or several remote devices accessible from one or several ports of the FGCP HA cluster.

Configuring and adding Remote Link Monitoring in a HA cluster configuration:

The following information is to be specified:

The IP address of the remote device(s) that need to be monitored.
The FortiGate unit interface is used for the remote link monitoring.
The protocol is used to enforce the remote link monitoring.
The ‘nominal’ PENALTY to be applied would be a health check failure that occurs.
The time interval between two health check attempts.
The number of tolerable health check failures before a failover is triggered.

Example of a Remote Link Monitoring configuration as defined above:

show system link-monitor
config system link-monitor
edit ha-link-monitor
set server 10.10.10.10                 <------------- 1.
set srcintf port1                      <------------- 2.
set protocol ping                      <------------- 3 (Ping is the default option setting).
set ha-priority 1                      <------------- 4   (1 is the default value).
set interval 5                         <------------- 5   (5 is the default value).
set failtime 2                         <------------- 6   (5 is the default value).
end

Once a Remote Link Monitor (also referred to as Remote IP Monitoring or PING server Monitoring in the documentation) has been defined, it can then be integrated into the cluster HA configuration and the following information specified:

The FortiGate unit interface(s) that are being used for the remote link monitoring.
The failover threshold value (value against which the ‘global’ PENALTY value is compared).
The type of failover mechanism (automated failback or not upon expiration of the FLIP timer).
The HA remote link monitoring FLIP timeout value.

Example of a HA cluster configuration combined with PING server monitoring:

show system ha
config system ha
…
set pingserver-monitor-interface port1        <------------- 1.
set pingserver-failover-threshold 0           <------------- 2   (0 is the default value).
set pingserver-secondary-force-reset disable      <------------- 3   (option enabled by default).
set pingserver-flip-timeout 60                <------------- 4   (by default set to 60 minutes).
…
end

Note: Having the 'pingserver-failover-threshold' variable set to '0' is a means to trigger a HA failover right after the remote link failure is detected.

Remote Link Monitoring considerations:

Note: Link monitoring is a mechanism to activate the FGCP HA election process. The decision to trigger a fail over or not is ultimately taken by the HA process itself, based on the HA parameters value such as the 'override' parameter being enabled or not, the HA priority value set on each cluster units, and so on.
The scenario detailed below is based on the assumption that the HA 'override' parameter is enabled and the cluster 'preferred' primary is set with a higher 'priority' value than the secondary. This type of setting is typically used when there is a need to have one of the cluster units acting, as far as possible, as a “preferred” primaryunit.

Using Remote Link Monitor in conjunction with the FGCP cluster High Availability:

Each time a remote link monitoring failure is detected by the HA cluster primary unit, the 'global' PENALTY that is by default set to '0’'is incremented by the 'nominal' PENALTY value (the 'ha-priority' parameter value) and compared to the fail over threshold value (the 'pingserver-failover-threshold' variable value).

When the threshold value is reached, the 'global' PENALTY value of the primary is compared with that of the secondary, and if it is higher, the FLIP timer is started and a failover occurs. The new primary starts monitoring the remote link on its own and will handle any remote link monitoring failure as described above, i.e., the ‘global’ PENALTY will be incremented by its 'nominal’'PENALTY value, up to the point the fail-over threshold value is reached. The action taken at the time the FLIP timer elapses will then depend upon the settings of the “pingserver-secondary-force-reset” variable value.

'pingserver-secondary-force-reset' variable is set to 'enable' (default setting).

When the FLIP timer elapses, the 'preferred' primary 'global' PENALTY is reset. Regardless of the remote link monitoring status on the new primary, the cluster automatically returns to normal operation, i.e., a failover occurs since the HA 'override' parameter is enabled and the HA 'priority' of the 'preferred' primary is higher than that of the new primary. The FLIP timer is started, and the 'preferred' primary unit starts remote link monitoring again. If the remote link is restored, the cluster continues to operate normally. If, however, the remote link is still down, remote link failover causes the cluster to fail over again at the time the FLIP timer expires.
This sequence, known as FLIP-FLOP failover, will repeat each time the FLIP timer expires, up until the failed remote link is restored.

'pingserver-secondary-force-reset' variable is set to 'disable'.

With this setting, the 'preferred' primary 'global' PENALTY is not reset when the FLIP timer elapses. This way, there will be no FLIP-FLOP failover if the new primary does not detect any remote link failover failure, the drawback being that the 'preferred' primary will never get a chance to become primary again, even if the remote link is restored on its side.

Only a manual failover (likely after restoring the ping server failure) or a remote link failure on the new primary side can trigger a failover. Indeed, in case the new primary also experiences a remote link failure, its 'global' PENALTY will be increased and become equal to that of the 'preferred' primary, thus causing the HA election process to start. In this case, the 'preferred' primary will take the cluster ownership back since the HA 'override' parameter is enabled and the HA 'priority' of the 'preferred' primary is higher than that of the new primary.

Troubleshooting the Remote Link Monitor process:

Verifying and controlling the Link Monitor can be done using the following command set:

diagnose debug application link-monitor -1

diagnose debug console timestamp enable

diagnose debug enable

Note: In an FGCP HA cluster, only the primary unit can perform remote link monitoring.

By design, a secondary unit cannot perform any monitoring since it has no active routing table. This can be verified from the following command excerpt, which was recorded on an FGCP HA cluster configured with link monitor and HA settings defined previously.
In the example below, FGT1 is configured as 'preferred' primary. It is primary at the beginning of the test and becomes secondary after it loses connectivity with the remote ping server.

Below is a link monitor test performed on FGT1 (primary unit):

1) 08:37:14: link monitor PING test towards the remote server (10.219.5.237) is done. It is successful
2) 08:37:19: 5 seconds later (cf. ‘interval’ variable setting) another PING test is done. It is successful
3) 08:37:24: 5 seconds later another successful PING test is done. It is successful
4) -> a loss of connectivity in between FGT1 and the remote ping server is simulated
5) 08:37:29: 5 seconds later another PING test is done but fails. It is done a second time (cf. ‘failtime’ variable setting) and also fails.
6) 08:37:31: link monitor is flagged as non-operational (cf. ‘ha-link-monitor is dead’ message)
7) 08:37:33: routing table is deactivated on FGT1 - failover occurs (FGT2 becomes primary)
8) 08:37:37: the PING test cycle is re-initiated but no packets are effectively issued since routing table is inactive
9) 08:37:42: idem than step 8
10) 08:37:47: idem than step 8

#FGT1 # diagnose debug application link-monitor -1 FGT1 # diag debug console timestamp enable FGT1 # diag debug enable

2019-07-03 08:37:14 lnkmtd::ping_send_msg(256): --> ping 10.219.5.237 seq_no=24344, icmp id=0, send 40 bytes
2019-07-03 08:37:14 lnkmtd:: ha-link-monitor send check request, try 1
2019-07-03 08:37:14 lnkmtd::ping_match(71): try matching ping response 10.219.5.237
2019-07-03 08:37:14 lnkmtd::ping_do_addr_up(57): ha-link-monitor->10.219.5.237(10.219.5.237), rcvd
2019-07-03 08:37:14 monitor_peer_recv-1790: lnkmtd: ha-link-monitor send time 1562135834s 205177us, revd time 1562135834s 206031us
2019-07-03 08:37:14 lnkmtd: ha-link-monitor all servers are probed after 1 times
2019-07-03 08:37:14 policy route related to the monitor(ha-link-monitor) may be added
2019-07-03 08:37:14 lnkmt_ha_mstate_build-182: monitor=ha-link-monitor, state=0,send sz=78
2019-07-03 08:37:14 rcvd cmd = 0

2019-07-03 08:37:19 lnkmtd::ping_send_msg(256): --> ping 10.219.5.237 seq_no=24345, icmp id=0, send 40 bytes
2019-07-03 08:37:19 lnkmtd:: ha-link-monitor send check request, try 1
2019-07-03 08:37:19 lnkmtd::ping_match(71): try matching ping response 10.219.5.237
2019-07-03 08:37:19 lnkmtd::ping_do_addr_up(57): ha-link-monitor->10.219.5.237(10.219.5.237), rcvd
2019-07-03 08:37:19 monitor_peer_recv-1790: lnkmtd: ha-link-monitor send time 1562135839s 205390us, revd time 1562135839s 206305us
2019-07-03 08:37:19 lnkmtd: ha-link-monitor all servers are probed after 1 times
2019-07-03 08:37:19 policy route related to the monitor(ha-link-monitor) may be added
2019-07-03 08:37:19 lnkmt_ha_mstate_build-182: monitor=ha-link-monitor, state=0,send sz=78
2019-07-03 08:37:19 rcvd cmd = 0

2019-07-03 08:37:24 lnkmtd::ping_send_msg(256): --> ping 10.219.5.237 seq_no=24346, icmp id=0, send 40 bytes
2019-07-03 08:37:24 lnkmtd:: ha-link-monitor send check request, try 1
2019-07-03 08:37:24 lnkmtd::ping_match(71): try matching ping response 10.219.5.237
2019-07-03 08:37:24 lnkmtd::ping_do_addr_up(57): ha-link-monitor->10.219.5.237(10.219.5.237), rcvd
2019-07-03 08:37:24 monitor_peer_recv-1790: lnkmtd: ha-link-monitor send time 1562135844s 205655us, revd time 1562135844s 206595us
2019-07-03 08:37:24 lnkmtd: ha-link-monitor all servers are probed after 1 times
2019-07-03 08:37:24 policy route related to the monitor(ha-link-monitor) may be added
2019-07-03 08:37:24 lnkmt_ha_mstate_build-182: monitor=ha-link-monitor, state=0,send sz=78
2019-07-03 08:37:24 rcvd cmd = 0

2019-07-03 08:37:29 lnkmtd::ping_send_msg(256): --> ping 10.219.5.237 seq_no=24347, icmp id=0, send 40 bytes
2019-07-03 08:37:29 lnkmtd:: ha-link-monitor send check request, try 1
2019-07-03 08:37:30 lnkmtd::ping_send_msg(256): --> ping 10.219.5.237 seq_no=24348, icmp id=0, send 40 bytes
2019-07-03 08:37:30 lnkmtd:: ha-link-monitor send check request, try 2
2019-07-03 08:37:30 lnkmtd: ha-link-monitor have tried 2 times, and will restart after 3 seconds
2019-07-03 08:37:31 lnkmtd: ha-link-monitor is dead.
2019-07-03 08:37:31 policy route related to the monitor(ha-link-monitor) may be removed
2019-07-03 08:37:31 lnkmt_ha_mstate_build-182: monitor=ha-link-monitor, state=1,send sz=78
2019-07-03 08:37:31 rcvd cmd = 0

2019-07-03 08:37:33 lnkmt_proute_refresh-582

2019-07-03 08:37:37 rcvd cmd = 0
2019-07-03 08:37:42 rcvd cmd = 0
2019-07-03 08:37:47 rcvd cmd = 0

Below is a link monitor test performed on FGT2 (secondary unit):

1) 08:37:14: PING test cycle triggers but no packets are effectively issued since routing table is inactive 08:37:19: PING test cycle triggers but no packets are effectively issued since routing table is inactive
2) 08:37:24: PING test cycle triggers but no packets are effectively issued since routing table is inactive
3) 08:37:29: PING test cycle triggers but no packets are effectively issued since routing table is inactive
4) 08:37:33: failover occurs (FGT2 becomes primary) - routing table is activated and all interfaces are brought UP
5) 08:37:37: link monitor PING test towards the remote server (10.219.5.237) is done. It is successful
6) 08:37:42: 5 seconds later (cf. ‘interval’ variable setting) another PING test is done. It is successful
7) 08:37:47: 5 seconds later another successful PING test is done. It is successful

#FGT2 # diagnose debug application link-monitor -1 FGT2 # diag debug console timestamp enable FGT2 # diag debug enable
2019-07-03 08:37:14 rcvd cmd = 0
2019-07-03 08:37:19 rcvd cmd = 0
2019-07-03 08:37:24 rcvd cmd = 0
2019-07-03 08:37:29 rcvd cmd = 0

2019-07-03 08:37:33 lnkmt_proute_refresh-582

2019-07-03 08:37:33 bring up 'mgmt2'
2019-07-03 08:37:33 bring up 'mgmt2' since all associated intfs are okay
2019-07-03 08:37:33 bring up 'port1'
2019-07-03 08:37:33 bring up 'port1' since all associated intfs are okay
…
2019-07-03 08:37:33 bring up 'wan1'
2019-07-03 08:37:33 bring up 'wan1' since all associated intfs are okay
2019-07-03 08:37:33 bring up 'wan2'
2019-07-03 08:37:33 bring up 'wan2' since all associated intfs are okay

2019-07-03 08:37:37 lnkmtd::ping_send_msg(256): --> ping 10.219.5.237 seq_no=2, icmp id=0, send 40 bytes
2019-07-03 08:37:37 lnkmtd:: ha-link-monitor send check request, try 1
2019-07-03 08:37:37 lnkmtd::ping_match(71): try matching ping response 10.219.5.237
2019-07-03 08:37:37 lnkmtd::ping_do_addr_up(57): ha-link-monitor->10.219.5.237(10.219.5.237), rcvd
2019-07-03 08:37:37 monitor_peer_recv-1790: lnkmtd: ha-link-monitor send time 1562135857s 305897us, revd time 1562135857s 306586us
2019-07-03 08:37:37 lnkmtd: ha-link-monitor all servers are probed after 1 times
2019-07-03 08:37:37 policy route related to the monitor(ha-link-monitor) may be added
2019-07-03 08:37:37 lnkmt_ha_mstate_build-182: monitor=ha-link-monitor, state=0,send sz=78
2019-07-03 08:37:37 rcvd cmd = 0
2019-07-03 08:37:42 lnkmtd::ping_send_msg(256): --> ping 10.219.5.237 seq_no=3, icmp id=0, send 40 bytes
2019-07-03 08:37:42 lnkmtd:: ha-link-monitor send check request, try 1
2019-07-03 08:37:42 lnkmtd::ping_match(71): try matching ping response 10.219.5.237
2019-07-03 08:37:42 lnkmtd::ping_do_addr_up(57): ha-link-monitor->10.219.5.237(10.219.5.237), rcvd
2019-07-03 08:37:42 monitor_peer_recv-1790: lnkmtd: ha-link-monitor send time 1562135862s 305933us, revd time 1562135862s 306853us
2019-07-03 08:37:42 lnkmtd: ha-link-monitor all servers are probed after 1 times
2019-07-03 08:37:42 policy route related to the monitor(ha-link-monitor) may be added
2019-07-03 08:37:42 lnkmt_ha_mstate_build-182: monitor=ha-link-monitor, state=0,send sz=78
2019-07-03 08:37:42 rcvd cmd = 0

2019-07-03 08:37:47 lnkmtd::ping_send_msg(256): --> ping 10.219.5.237 seq_no=4, icmp id=0, send 40 bytes
2019-07-03 08:37:47 lnkmtd:: ha-link-monitor send check request, try 1
2019-07-03 08:37:47 lnkmtd::ping_match(71): try matching ping response 10.219.5.237
2019-07-03 08:37:47 lnkmtd::ping_do_addr_up(57): ha-link-monitor->10.219.5.237(10.219.5.237), rcvd
2019-07-03 08:37:47 monitor_peer_recv-1790: lnkmtd: ha-link-monitor send time 1562135867s 305968us, revd time 1562135867s 306875us
2019-07-03 08:37:47 lnkmtd: ha-link-monitor all servers are probed after 1 times
2019-07-03 08:37:47 policy route related to the monitor(ha-link-monitor) may be added
2019-07-03 08:37:47 lnkmt_ha_mstate_build-182: monitor=ha-link-monitor, state=0,send sz=78

Verifying the accessibility of remote devices used by the Link Monitoring process:

Remote device accessibility cannot be assessed from the secondary unit using a PING test. Indeed, a PING test executed from a secondary unit may show a remote device to be accessible from this unit, it is potentially not. Indeed, the ICMP echo requests are in reality not issued directly by the secondary unit but are handed over to the primary unit via the HA link. The primary unit then forwards the requests to the remote device and relays the ICMP echo replies it receives back to the secondary unit via the HA link.

This can be verified in the example below, wherein the PING test executed from FGT1 (secondary unit) is effectively processed by FGT2 (primary unit).

Below is a 'successful' PING test performed from the secondary unit:

#FGT1 # execute ping 10.219.5.237

PING 10.219.5.237 (10.219.5.237): 56 data bytes
64 bytes from 10.219.5.237: icmp_seq=0 ttl=125 time=1.2 ms
64 bytes from 10.219.5.237: icmp_seq=1 ttl=125 time=1.0 ms

Below is a packet capture running on the primaryunit and highlighting that ICMP echo requests are passed to the primary unit via the HA heartbeat interface then sent to the remote device via port3. ICMP echo replies are received from port3 and forwarded back to the secondary unit via the HA heartbeat interface.

#FGT2 # diagnose sniffer packet any 'icmp' 4 0 a

interfaces=[any]
filters=[icmp]
2019-07-03 07:17:07.133523 port_ha in 169.254.0.2 -> 10.219.5.237: icmp: echo request
2019-07-03 07:17:07.133555 havdlink0 out 169.254.0.65 -> 10.219.5.237: icmp: echo request
2019-07-03 07:17:07.133555 havdlink1 in 169.254.0.65 -> 10.219.5.237: icmp: echo request
2019-07-03 07:17:07.133575 port3 out 10.217.2.30 -> 10.219.5.237: icmp: echo request
2019-07-03 07:17:07.134260 port3 in 10.219.5.237 -> 10.217.2.30: icmp: echo reply
2019-07-03 07:17:07.134270 havdlink1 out 10.219.5.237 -> 169.254.0.65: icmp: echo reply
2019-07-03 07:17:07.134270 havdlink0 in 10.219.5.237 -> 169.254.0.65: icmp: echo reply
2019-07-03 07:17:07.134278 port_ha out 10.219.5.237 -> 169.254.0.2: icmp: echo reply
2019-07-03 07:17:07.134281 port5 out 10.219.5.237 -> 169.254.0.2: icmp: echo reply

2019-07-03 07:17:08.133626 port_ha in 169.254.0.2 -> 10.219.5.237: icmp: echo request
2019-07-03 07:17:08.133630 havdlink0 out 169.254.0.65 -> 10.219.5.237: icmp: echo request
2019-07-03 07:17:08.133630 havdlink1 in 169.254.0.65 -> 10.219.5.237: icmp: echo request
2019-07-03 07:17:08.133640 port3 out 10.217.2.30 -> 10.219.5.237: icmp: echo request
2019-07-03 07:17:08.134363 port3 in 10.219.5.237 -> 10.217.2.30: icmp: echo reply
2019-07-03 07:17:08.134370 havdlink1 out 10.219.5.237 -> 169.254.0.65: icmp: echo reply
2019-07-03 07:17:08.134370 havdlink0 in 10.219.5.237 -> 169.254.0.65: icmp: echo reply
2019-07-03 07:17:08.134374 port_ha out 10.219.5.237 -> 169.254.0.2: icmp: echo reply
2019-07-03 07:17:08.134375 port5 out 10.219.5.237 -> 169.254.0.2: icmp: echo reply

seshuganesh · ‎06-08-2022

Very good explanations.. Thank you

Technical Tip: Combining Remote Link Monitoring with FGCP cluster High Availability

You are leaving our website