Management-ip not accessible on slave node in cluster

KordiaRG · ‎07-03-2018

Hi all,

I'm running a pair of 60E's on 5.6.3 as a HA cluster. I've setup a VLAN interface for management to the root VDOM, given it an IP, and also given each member a management-ip in the same subnet. So for example, vlan interface is VL100-MGMT, IP is 10.0.100.10/24, and each node has a management-ip set as 10.0.100.11/24 and 10.0.100.12/24 respectively.

I can access the VIP (10.0.100.10) fine. I can access node A's management-ip (10.0.100.11) fine. However, I cannot access node B's management IP. A diag sniffer shows no traffic for .12 going to node B. The cluster appears otherwise healthy and a diag sys ha status looks good.

Looking at the arp table on the gateway, I see that all 3 of these addresses have entries, but all to the same virtual MAC:

10.0.100.10 = 00:09:0f:09:00:03

10.0.100.11 = 00:09:0f:09:00:03

10.0.100.12 = 00:09:0f:09:00:03

The odd thing is, I have an almost identical config on another cluster of 60E's (same version), and these work fine. On these, the arp table of the gateway shows node B's management-ip with the hardware address of node B, which seems sensible...

10.0.100.10 = 00:09:0f:09:00:03

10.0.100.11 = 00:09:0f:09:00:03

10.0.100.12 = 90:6c:ac:0a:0b:0c

Anyone else seen this? A bug?

I was about to log a ticket, but annoyingly this is the 14th site in a national rollout which means the time since purchase is around 9 months and the support auto-started and expired a week ago :(

Thanks,

Richard

KordiaRG · ‎07-05-2018

emnoc wrote:
On FW1B login and do the following;

get router info routing connect
diag ip arp list

Do you see anything ?

Nothing on the routing table:

FW01B (root) $ get router info routing-table connected

FW01B (root) $ get router info routing-table all

FW01B (root) $ diagnose ip arp list
index=33 ifname=VL100-MGMT 10.0.100.10 00:09:0f:09:00:02 state=00000004 use=3879 confirm=9879 update=3879 ref=0
index=14 ifname=internal7 169.254.0.2 90:6c:ac:90:73:d1 state=00000080 use=15917302 confirm=15917302 update=15917302 ref=0
index=13 ifname=internal6 169.254.0.2 90:6c:ac:90:73:d0 state=00000080 use=15921963 confirm=15921786 update=15921786 ref=0

Here's from node A for reference:

FW01A (root) # get router info routing-table connected

C 10.0.100.0/24 is directly connected, VL100-MGMT
is directly connected, VL100-MGMT
C 192.168.1.0/24 is directly connected, internal

FW01A (root) # get router info routing-table all
Codes: K - kernel, C - connected, S - static, R - RIP, B - BGP
O - OSPF, IA - OSPF inter area
N1 - OSPF NSSA external type 1, N2 - OSPF NSSA external type 2
E1 - OSPF external type 1, E2 - OSPF external type 2
i - IS-IS, L1 - IS-IS level-1, L2 - IS-IS level-2, ia - IS-IS inter area
* - candidate default

S* 0.0.0.0/0 [10/0] via 10.0.100.1, VL100-MGMT
C 10.0.100.0/24 is directly connected, VL100-MGMT
is directly connected, VL100-MGMT
C 192.168.1.0/24 is directly connected, internal

FW01A (root) # diagnose ip arp list
index=33 ifname=VL100-MGMT 10.0.100.12 state=00000000 use=3835 confirm=9835 update=3835 ref=0
index=14 ifname=internal7 169.254.0.1 90:6c:ac:90:84:93 state=00000080 use=15956308 confirm=15956308 update=15956308 ref=0
index=33 ifname=VL100-MGMT 10.0.100.1 00:09:0f:09:00:02 state=00000002 use=15 confirm=9 update=3835 ref=23
index=13 ifname=internal6 169.254.0.1 90:6c:ac:90:84:92 state=00000080 use=15960968 confirm=15960927 update=15960927 ref=0

In comparison to another cluster I referred to where this is working, it looks similar (except arp for internal7 is missing). internal6 & internal7 are my HA sync interfaces.

Rich

KordiaRG · ‎07-12-2018

Oddly, my monitoring tool has sent me 3-4 alerts over the past few days whereby the B node starts to respond to pings on its management-ip for about 1 minute and then stops again.

Rich

Toshi_Esumi · ‎07-12-2018

I dug up the old TAC ticket when we worked with TAC to fix the problem on the slave mgmt interface issue. At that time when we sniff packets on the interface getting into the slave from master (exe ha manage) while started pining, we say arp request coming into but the unit didn't reply at all:

2017-09-28 19:19:14.954190 mgmt1 -- arp who-has [MGMT1_IP] tell [PING_SRC_IP]

2017-09-28 19:19:21.138929 mgmt1 -- arp who-has [MGMT1_IP] tell [PING_SRC_IP]

Then checked the routing table. Ours is vdom envronment so had to run a special command at management vdom for "vsys_hamgmt" hidden mgmt ha vdom below:

atl-fg2 (management) # diagnose ip router command show-vrf vsys_hamgmt show ip route

atl-fg2 (management) #

And it was empty.

What TAC did to fix the problem at least temporarily was disabling the ha-mgmt-status in the HA configuration, Unsetting the mgmt1 interface dedicated management and reconfiguring those settings back again, which is blow:

config sys ha

set ha-mgmt-status enable set ha-mgmt-interface "mgmt1" set ha-mgmt-interface-gateway x.x.x.x

end

Probably it would come back even if it fixes it if some HA events happen in the future. It was a MAC address change problem, which was fixed by 5.4.6. So we haven't experiencing since we upgraded the cluster above 5.4.6 as i wrote before.

KordiaRG · ‎07-12-2018

Thanks Toshi. I may try out some of that. We're also using VDOMs.

We don't use dedicated management interface in the HA config, but you raise a good point about reapplying the config. Maybe I should remove the management-ip and re-enter it to see if that fixes.

Otherwise, I'll likely schedule a window and reboot the cluster, and possibly do an upgrade whilst I'm at it.

Cheers,

Rich

KordiaRG · ‎07-05-2018

ede_pfau wrote:
You might try to set the originating IP address before you ping:
exec ping-option source a.b.c.d

Otherwise, it might pick some IP address from a 'nearby' FGT port which just won't fit.

Oddly, the "source" keyword is not available from the standby node.

FW01B (root) $ execute ping-options

adaptive-ping Adaptive ping <enable|disable>.
data-size Integer value to specify datagram size in bytes.
df-bit Set DF bit in IP header <yes | no>.
interval Integer value to specify seconds between two pings.
pattern Hex format of pattern, e.g. 00ffaabb.
repeat-count Integer value to specify how many times to repeat PING.
reset Reset settings.
timeout Integer value to specify timeout in seconds.
tos IP type-of-service option.
ttl Integer value to specify time-to-live.
validate-reply Validate reply data <yes | no>.
view-settings View the current settings for PING option.

Management-ip not accessible on slave node in cluster

Nominate a Forum Post for Knowledge Article Creation

You are leaving our website