Interesting way that FortiGate began to drop all traffic on a VLAN interface / Bug?

veechee · ‎08-27-2015

I just overcame an interesting issue last night with a remote office. Unit is a FGT-60C running 4.3.18 with no significant changes made for a long time.

The relevant setup details:

[ul]

Domain VLAN and Guest VLAN trunked to one switch port on FGT.

3 WAN lines (1 x Ethernet with own subnet/static IP, 2 x PPPoE DSL with dynamically assigned static IPs)

WAN lines are in an Interface group

On the Guest VLAN, there are several Policy Routes that direct traffic from specific IP addresses out to a particular interface and gateway. (This is so video conferencing uses a particular line, and so that mobile phone femtocells each get a dedicated DSL line to operate on.)[/ul]

What was the problem?

[ul]

All traffic on the Guest VLAN ceased to work. Clients could connect to the Wi-Fi (besides the video conferencing and mobile phone femtocells, all clients on the Guest VLAN are wireless) and would get assigned an IP address via DHCP. Clients could even resolve DNS (DNS points to FGT interface address and FGT then resolves using system DNS servers).[/ul]

What did I initially think was the issue?

[ul]

Problem began around the time of a major firmware update to the wireless AP in the office, so my initial assumption was that something had gone wrong with the AP, however, the same firmware on another AP in other office was working perfectly.

I exhaustively troubleshot the AP (keep in mind, this was remotely so reliant on non-tech users for feedback), but could find no issue. I started to think it was hardware until I realized the video conferencing and femtocells were also down. (The femtocells are great for troubleshooting because I can view them making a connection easily on the FGT, and they have a single light that is green, yellow or red depending on the status so also easy for users to report back to me.)[/ul]

What was the actual issue and fix?

[ul]

The ISP that provides the 2 x PPPoE DSL lines changed the default gateway without any announcement. (I have monitors in place to make sure the interfaces are up from the outside, but the static IP addresses themselves did not change.)

After combing through the settings on the FGT, I found the default gateways didn't match in the two policy routes that used the PPPoE DSL lines.

I updated the default gateway on the policy routes to match what was showing in interface settings, and traffic immediately started to flow normally again.

It may be the specific gateway change contributed to this, so I will post them (xxx is the same on all): Old settings: DSL1 - IP: xxx.28.118.230 / Gateway: xxx.28.118.254 DSL2 - IP: xxx.28.82.254 / Gateway: xxx.28.82.254 New gateway settings: DSL1 - IP: xxx.28.118.230 / Gateway: xxx.28.124.253 DSL2 - IP: xxx.28.82.254 / Gateway: xxx.28.124.253 So as you can see above, the gateway became the same for both DSL lines, and the ISP enlarged the size of the subnet from a /24 to perhaps a /18.[/ul]

Why was this worth posting?

[ul]

It does not seem to me that 'normal' behaviour if a policy route for a single IP address has a bad gateway address that all traffic traversing the interface dies. Is this a bug?

Where I got really lucky here was that I do not use policy routes to the WAN on the Domain VLAN. If that had died on me, I would have had no way to even look at this problem remotely, and I don't know how I would have ever concluded that was the issue if I didn't see it for myself! One of the reasons I maintain multiple Internet connections is so I always have a way in! There were already some other reasons I wanted to undo the WAN zone and manage each connection with separate policies/rules, but this now seals the deal for me (as I am assuming if they were separate I wouldn't have lost traffic across WAN1 which was not PPPoE DSL and had no change that should have caused it to not pass traffic). Next time I am at this remote office, I am going to work a weekend and separate the connections out![/ul]

emnoc · ‎08-27-2015

[ul]
The ISP that provides the 2 x PPPoE DSL lines changed the default gateway without any announcement. (I have monitors in place to make sure the interfaces are up from the outside, but the static IP addresses themselves did not change.)
After combing through the settings on the FGT, I found the default gateways didn't match in the two policy routes that used the PPPoE DSL lines.
I updated the default gateway on the policy routes to match what was showing in interface settings, and traffic immediately started to flow normally again.[/ul]

You had changes in your ISP, so how would you expect this to be a bug?

A diag debug flow output might have shed some light and a diag arp list for the next-hop gateways address.

If I'm understanding your current setup now, you have 2x DSL from the same provider and with next-hop-gateway address being the same?

PCNSE

NSE

StrongSwan

veechee · ‎08-31-2015

emnoc,

When the ISP settings change, they populate automatically onto the interfaces.

So while the settings for the ISP were correct, and the Domain VLAN passed all traffic normally, the Guest VLAN passed no traffic whatsoever, just because of the policy routes had an incorrect gateway which were only applicable to two specific IPs/devices. I can't believe that is expected behaviour, hence my question if it is a bug.

emnoc · ‎08-31-2015

When the ISP settings change, they populate automatically onto the interfaces.

Qs for clarity;

Q1:So you have DHCP on ISP facing interface?

Q2:Both interface are assigned in the same local broadcast domain and with the same netmask and layer3 next-hop?

Q3: on the PBR and bad gateway was the gateway out of the layer3 netmask?

Q4: in your PBR cfg did you use gateway and output device or just defined one

Q5: I know it's late know, but from a case perspective could you re-create the post ISP PBR route for let's say one destination and run a diag debug flow but also look at the layer2 arp entry diag ip arp list

I think think the bug might be if both interface where dhcp-dynamic-assigned & within the same subnet ( see Q1 ). if this happen that would not be a good thing.

PCNSE

NSE

StrongSwan

veechee · ‎09-04-2015

emnoc wrote:
Q1:So you have DHCP on ISP facing interface?

Yes. But via PPPoE so it is PPPoE that is selected on the interface, not the DHCP option. But no settings besides the PPPoE username and password are hard coded by me on the FGT.

emnoc wrote:
Q2:Both interface are assigned in the same local broadcast domain and with the same netmask and layer3 next-hop?

They always had the same next-hop, but prior to the unannounced change they had different default gateways. Now they have the same default gateway also.

emnoc wrote:
Q3: on the PBR and bad gateway was the gateway out of the layer3 netmask?

With PPPoE I don't see what the netmask of the default gateway is, but by the separation between the IPs I am assigned and the gateway, it is seems to me it is fairly large. In the routing table, the default gateway is /32, as are my IPs. Here is what it shows right now:

Connected xxx.28.118.230/32 0 0 0.0.0.0 ppp1

Connected xxx.28.82.108/32 0 0 0.0.0.0 ppp2
Connected xxx.28.124.253/32 0 0 0.0.0.0 ppp1
Connected xxx.28.124.253/32 0 0 0.0.0.0 ppp2

emnoc wrote:
Q4: in your PBR cfg did you use gateway and output device or just defined one

I think what you're asking is if I defined a destination address/mask or not? I only defined the incoming interface, source address (i.e., specific device I wanted to use the PBR), outgoing interface and gateway address.

emnoc wrote:
Q5: I know it's late know, but from a case perspective could you re-create the post ISP PBR route for let's say
one destination and run a diag debug flow but also look at the layer2 arp entry diag ip arp list

Sorry but this office is across an ocean, so it would be irresponsible to try re-create such a serious issue. If I had lost the Domain VLAN as well because of this, I probably would have been getting on a plane to sort this out because everything would have died.

emnoc wrote:
I think think the bug might be if both interface where dhcp-dynamic-assigned & within the same subnet ( see Q1 ). if this happen that would not be a good thing.

I agree with this. I am sure it is a very specific set of circumstances that led to all the traffic dying as it did.

emnoc · ‎09-05-2015

They always had the same next-hop, but prior to the unannounced change they had different default gateways. Now they have the same default gateway also.

FWIW , a next-hop == gateway

You can open a ticket with TAC, provide the before and after cfg and before & after related network changes and see what they say.

PCNSE

NSE

StrongSwan

Interesting way that FortiGate began to drop all traffic on a VLAN interface / Bug?

Nominate a Forum Post for Knowledge Article Creation