Solution |
For example, a link-monitor has been configured on HA as follows:

In this situation, if Router-1 goes down, the link-monitor failure will be detected by 4 seconds.
- Time = 'interval' * 'failtime' + 'probe-timeout'
- Time = 1 * 3 + 1 = 4 seconds
Here is the output of link-monitor's debug log:
2023-12-12 23:34:06 lnkmtd::ping_send_msg(409): ---> ping 10.10.10.254 seq_no=13447, icmp id=1, send 20 bytes 2023-12-12 23:34:06 lnkmtd::monitor_proto_peer_send_request(605): ---> 1(10.10.10.254:ping) send probe packet, fail count(0) 2023-12-12 23:34:06 lnkmtd::ping_do_addr_up(116): ---> 1->10.10.10.254(10.10.10.254), rcvd 2023-12-12 23:34:06 lnkmtd::monitor_peer_recv(1992): ---> 1 send time 1702391646s 118905us, revd time 1702391646s 119003us 2023-12-12 23:34:07 lnkmtd::ping_send_msg(409): ---> ping 10.10.10.254 seq_no=13448, icmp id=1, send 20 bytes 2023-12-12 23:34:07 lnkmtd::monitor_proto_peer_send_request(605): ---> 1(10.10.10.254:ping) send probe packet, fail count(0) 2023-12-12 23:34:08 lnkmtd::ping_send_msg(409): ---> ping 10.10.10.254 seq_no=13449, icmp id=1, send 20 bytes 2023-12-12 23:34:08 lnkmtd::monitor_proto_peer_send_request(605): ---> 1(10.10.10.254:ping) send probe packet, fail count(1) 2023-12-12 23:34:09 lnkmtd::ping_send_msg(409): ---> ping 10.10.10.254 seq_no=13450, icmp id=1, send 20 bytes 2023-12-12 23:34:09 lnkmtd::monitor_proto_peer_send_request(605): ---> 1(10.10.10.254:ping) send probe packet, fail count(2) 2023-12-12 23:34:10 lnkmtd::monitor_ppeer_fail(1682): ---> 1(10.10.10.254 ping) is dead.
If L2_switch-1 goes down then HA failover occurs, and the link-monitor failure is detected by 14 seconds.
- Time = 'interval' * 'failtime' + 'probe-timeout' + 'lnkmtd cold start'
- Time = 1 * 3 + 1 + 10 = 14 seconds
Here is the output of link-monitor's debug log:
2023-12-12 23:36:11 ha_sync_handle_reset()-471: num_peers=1, local_ip=169.254.0.1 2023-12-12 23:36:11 lnkmtd::ping_send_msg(409): ---> ping 10.10.10.254 seq_no=2215, icmp id=709, send 20 bytes 2023-12-12 23:36:11 lnkmtd::monitor_proto_peer_send_request(605): ---> 1(10.10.10.254:ping) send probe packet, fail count(0) 2023-12-12 23:36:12 lnkmtd::ping_send_msg(409): ---> ping 10.10.10.254 seq_no=2216, icmp id=709, send 20 bytes 2023-12-12 23:36:12 lnkmtd::monitor_proto_peer_send_request(605): ---> 1(10.10.10.254:ping) send probe packet, fail count(0)
<SNIP> 2023-12-12 23:36:20 lnkmtd::ping_send_msg(409): ---> ping 10.10.10.254 seq_no=2224, icmp id=709, send 20 bytes 2023-12-12 23:36:20 lnkmtd::monitor_proto_peer_send_request(605): ---> 1(10.10.10.254:ping) send probe packet, fail count(0) 2023-12-12 23:36:21 lnkmtd::ping_send_msg(409): ---> ping 10.10.10.254 seq_no=2225, icmp id=709, send 20 bytes 2023-12-12 23:36:21 lnkmtd::monitor_proto_peer_send_request(605): ---> 1(10.10.10.254:ping) send probe packet, fail count(0) 2023-12-12 23:36:22 lnkmtd::ping_send_msg(409): ---> ping 10.10.10.254 seq_no=2226, icmp id=709, send 20 bytes 2023-12-12 23:36:22 lnkmtd::monitor_proto_peer_send_request(605): ---> 1(10.10.10.254:ping) send probe packet, fail count(1) 2023-12-12 23:36:23 lnkmtd::ping_send_msg(409): ---> ping 10.10.10.254 seq_no=2227, icmp id=709, send 20 bytes 2023-12-12 23:36:23 lnkmtd::monitor_proto_peer_send_request(605): ---> 1(10.10.10.254:ping) send probe packet, fail count(2) 2023-12-12 23:36:24 lnkmtd::monitor_ppeer_fail(1682): ---> 1(10.10.10.254 ping) is dead.
This cold start mechanism was introduced in v7.0.11, v7.2.5 v7.4.0 because in a failover situation, it will take some time to update virtual MAC Addresses, Elastic IP or Public Static IP (in cloud environments) to the new master and if link-monitor kicks in, it can cause a failover back to former primary because there are packet loss on probes.
Note:
The previous example is for HA failover in which the link-monitor status=dead, but the same applies for status=alive. After HA failover, it will take 10 seconds plus the configured timers in the config system link-monitor to update the link-monitor status for status=alive.
Starting from v7.0.14, v7.2.8, and v7.4.2 to not have packet loss statistics on the link-monitor, in the cold start duration, the seq_no will not increase, not counting these failed probes into the statistics.
2025-06-09 11:53:25 HA event 2025-06-09 11:53:25 ha_sync_handle_reset()-559: num_peers=1, local_ip=20.1.2.10 2025-06-09 11:53:25 lnkmtd::ping_send_msg(435): ---> ping 89.180.243.203 seq_no=3086, icmp id=7, send 20 bytes 2025-06-09 11:53:25 lnkmtd::monitor_proto_peer_send_request(698): ---> L_M_Port1(89.180.243.203:ping) send probe packet, fail count(0) 2025-06-09 11:53:25 lnkmtd::ping_send_msg(435): ---> ping 89.180.243.203 seq_no=3086, icmp id=7, send 20 bytes 2025-06-09 11:53:35 lnkmtd::monitor_proto_peer_send_request(698): ---> L_M_Port1(89.180.243.203:ping) send probe packet, fail count(0) 2025-06-09 11:53:35 lnkmtd::ping_do_addr_up(136): ---> L_M_Port1->89.180.243.203(89.180.243.203), rcvd 2025-06-09 11:53:35 lnkmtd::ping_send_msg(435): ---> ping 89.180.243.203 seq_no=3086, icmp id=7, send 20 bytes 2025-06-09 11:53:35 lnkmtd::monitor_proto_peer_send_request(698): ---> L_M_Port1(89.180.243.203:ping) send probe packet, fail count(0) 2025-06-09 11:53:35 lnkmtd::ping_do_addr_up(136): ---> L_M_Port1->89.180.243.203(89.180.243.203), rcvd ... 2025-06-09 11:53:35 lnkmtd::ping_send_msg(435): ---> ping 89.180.243.203 seq_no=3087, icmp id=7, send 20 bytes <----- 10 seconds have passed since failover, cold start timer finishes, probes started to count. 2025-06-09 11:53:35 lnkmtd::monitor_proto_peer_send_request(698): ---> L_M_Port1(89.180.243.203:ping) send probe packet, fail count(0) 2025-06-09 11:53:35 lnkmtd::ping_do_addr_up(136): ---> L_M_Port1->89.180.243.203(89.180.243.203), rcvd 2025-06-09 11:53:35 lnkmtd::monitor_peer_recv(2219): ---> L_M_Port1 send time 1749466415s 821681us, revd time 1749466415s 867711us 2025-06-09 11:53:35 lnkmtd::monitor_proute_cmdb_set(1147): ---> policy routes or internet service routes related to the monitor(L_M_Port1) may be added 2025-06-09 11:53:35 lnkmtd::lnkmt_addr_mode_do_downgateway4(369): ---> added route vd(root), oif=port1(3) gateway(0.0.0.0) for subnet(0.0.0.0/0) 2025-06-09 11:53:35 lnkmtd::ping_send_msg(435): ---> ping 89.180.243.203 seq_no=3088, icmp id=7, send 20 bytes 2025-06-09 11:53:35 lnkmtd::monitor_proto_peer_send_request(698): ---> L_M_Port1(89.180.243.203:ping) send probe packet, fail count(0) 2025-06-09 11:53:35 lnkmtd::ping_do_addr_up(136): ---> L_M_Port1->89.180.243.203(89.180.243.203), rcvd 2025-06-09 11:53:35 lnkmtd::monitor_peer_recv(2219): ---> L_M_Port1 send time 1749466415s 842915us, revd time 1749466415s 889214us 2025-06-09 11:53:35 lnkmtd::ping_send_msg(435): ---> ping 89.180.243.203 seq_no=3089, icmp id=7, send 20 bytes 2025-06-09 11:53:35 lnkmtd::monitor_proto_peer_send_request(698): ---> L_M_Port1(89.180.243.203:ping) send probe packet, fail count(0) 2025-06-09 11:53:35 lnkmtd::ping_do_addr_up(136): ---> L_M_Port1->89.180.243.203(89.180.243.203), rcvd
Mon Jun 9 11:53:35 WEST 2025 AWS-HA-Active # diag sys link-monitor status Link Monitor: L_M_Port1, Status: dead, Server num(1), cfg_version=0 HA state: local(dead), shared(dead)
Mon Jun 9 11:53:36 WEST 2025 AWS-HA-Active # diag sys link-monitor status Link Monitor: L_M_Port1, Status: alive
The status is updated at 11:53:35. In this example Time = 0.02 * 2 + 0.1 + 10 = 10.14 seconds.
|