Support Forum
The Forums are a place to find answers on a range of Fortinet products from peers and product experts.
veechee
New Contributor

Interesting way that FortiGate began to drop all traffic on a VLAN interface / Bug?

I just overcame an interesting issue last night with a remote office.  Unit is a FGT-60C running 4.3.18 with no significant changes made for a long time.

 

The relevant setup details:

[ul]
  • Domain VLAN and Guest VLAN trunked to one switch port on FGT.
  • 3 WAN lines (1 x Ethernet with own subnet/static IP, 2 x PPPoE DSL with dynamically assigned static IPs)
  • WAN lines are in an Interface group
  • On the Guest VLAN, there are several Policy Routes that direct traffic from specific IP addresses out to a particular interface and gateway.  (This is so video conferencing uses a particular line, and so that mobile phone femtocells each get a dedicated DSL line to operate on.)[/ul]

     

    What was the problem?

    [ul]
  • All traffic on the Guest VLAN ceased to work.  Clients could connect to the Wi-Fi (besides the video conferencing and mobile phone femtocells, all clients on the Guest VLAN are wireless) and would get assigned an IP address via DHCP.  Clients could even resolve DNS (DNS points to FGT interface address and FGT then resolves using system DNS servers).[/ul]

     

    What did I initially think was the issue?

    [ul]
  • Problem began around the time of a major firmware update to the wireless AP in the office, so my initial assumption was that something had gone wrong with the AP, however, the same firmware on another AP in other office was working perfectly.
  • I exhaustively troubleshot the AP (keep in mind, this was remotely so reliant on non-tech users for feedback), but could find no issue.  I started to think it was hardware until I realized the video conferencing and femtocells were also down. (The femtocells are great for troubleshooting because I can view them making a connection easily on the FGT, and they have a single light that is green, yellow or red depending on the status so also easy for users to report back to me.)[/ul]

     

    What was the actual issue and fix?

    [ul]
  • The ISP that provides the 2 x PPPoE DSL lines changed the default gateway without any announcement.  (I have monitors in place to make sure the interfaces are up from the outside, but the static IP addresses themselves did not change.)
  • After combing through the settings on the FGT, I found the default gateways didn't match in the two policy routes that used the PPPoE DSL lines.
  • I updated the default gateway on the policy routes to match what was showing in interface settings, and traffic immediately started to flow normally again.
  • It may be the specific gateway change contributed to this, so I will post them (xxx is the same on all): Old settings: DSL1 - IP: xxx.28.118.230 / Gateway: xxx.28.118.254 DSL2 - IP: xxx.28.82.254 / Gateway: xxx.28.82.254 New gateway settings: DSL1 - IP: xxx.28.118.230 / Gateway: xxx.28.124.253 DSL2 - IP: xxx.28.82.254 / Gateway: xxx.28.124.253 So as you can see above, the gateway became the same for both DSL lines, and the ISP enlarged the size of the subnet from a /24 to perhaps a /18.[/ul]

    Why was this worth posting?

    [ul]
  • It does not seem to me that 'normal' behaviour if a policy route for a single IP address has a bad gateway address that all traffic traversing the interface dies.  Is this a bug?
  • Where I got really lucky here was that I do not use policy routes to the WAN on the Domain VLAN.  If that had died on me, I would have had no way to even look at this problem remotely, and I don't know how I would have ever concluded that was the issue if I didn't see it for myself!  One of the reasons I maintain multiple Internet connections is so I always have a way in! There were already some other reasons I wanted to undo the WAN zone and manage each connection with separate policies/rules, but this now seals the deal for me (as I am assuming if they were separate I wouldn't have lost traffic across WAN1 which was not PPPoE DSL and had no change that should have caused it to not pass traffic).  Next time I am at this remote office, I am going to work a weekend and separate the connections out![/ul]

     

  • 5 REPLIES 5
    emnoc
    Esteemed Contributor III

    [ul]
  • The ISP that provides the 2 x PPPoE DSL lines changed the default gateway without any announcement.  (I have monitors in place to make sure the interfaces are up from the outside, but the static IP addresses themselves did not change.)
  • After combing through the settings on the FGT, I found the default gateways didn't match in the two policy routes that used the PPPoE DSL lines.
  • I updated the default gateway on the policy routes to match what was showing in interface settings, and traffic immediately started to flow normally again.[/ul]

  • You had changes in your ISP,  so how would  you  expect this  to be a bug?

     

    A diag debug flow output might have shed some light and a diag arp list  for the next-hop gateways address.

     

    If I'm understanding your current setup now, you have  2x DSL from the same provider and with next-hop-gateway address being the same?

     

     

    PCNSE 

    NSE 

    StrongSwan  

    PCNSE NSE StrongSwan
    veechee
    New Contributor

    emnoc,

     

    When the ISP settings change, they populate automatically onto the interfaces.

    So while the settings for the ISP were correct, and the Domain VLAN passed all traffic normally, the Guest VLAN passed no traffic whatsoever, just because of the policy routes had an incorrect gateway which were only applicable to two specific IPs/devices.  I can't believe that is expected behaviour, hence my question if it is a bug.

    emnoc
    Esteemed Contributor III

     

    When the ISP settings change, they populate automatically onto the interfaces.

     

     

    Qs for clarity;

     

    Q1:So you have DHCP on ISP facing interface?

     

    Q2:Both interface are assigned in the same local broadcast domain and with the same netmask and layer3 next-hop?

     

    Q3: on the PBR and bad gateway was the gateway out of the layer3 netmask?

     

    Q4: in your PBR cfg did  you use gateway and output device  or just defined one

     

    Q5: I know it's late know, but from a case perspective could you re-create the post ISP PBR route for let's say one destination and run a diag debug flow but also look at the layer2 arp entry diag ip arp list

     

    I think think the bug might be if both interface where dhcp-dynamic-assigned & within the same subnet ( see Q1 ). if this happen that would not be a good thing.

     

     

    PCNSE 

    NSE 

    StrongSwan  

    PCNSE NSE StrongSwan
    veechee
    New Contributor

    emnoc wrote:
    Q1:So you have DHCP on ISP facing interface?

    Yes.  But via PPPoE so it is PPPoE that is selected on the interface, not the DHCP option.  But no settings besides the PPPoE username and password are hard coded by me on the FGT.

    emnoc wrote:
    Q2:Both interface are assigned in the same local broadcast domain and with the same netmask and layer3 next-hop?
    They always had the same next-hop, but prior to the unannounced change they had different default gateways.  Now they have the same default gateway also.

    emnoc wrote:
    Q3: on the PBR and bad gateway was the gateway out of the layer3 netmask?
    With PPPoE I don't see what the netmask of the default gateway is, but by the separation between the IPs I am assigned and the gateway, it is seems to me it is fairly large.  In the routing table, the default gateway is /32, as are my IPs.  Here is what it shows right now:

    Connected xxx.28.118.230/32 0 0 0.0.0.0 ppp1

    Connected xxx.28.82.108/32 0 0 0.0.0.0 ppp2
    Connected xxx.28.124.253/32 0 0 0.0.0.0 ppp1
    Connected xxx.28.124.253/32 0 0 0.0.0.0 ppp2

                                   

    emnoc wrote:
    Q4: in your PBR cfg did  you use gateway and output device  or just defined one

    I think what you're asking is if I defined a destination address/mask or not?  I only defined the incoming interface, source address (i.e., specific device I wanted to use the PBR), outgoing interface and gateway address.

    emnoc wrote:
    Q5: I know it's late know, but from a case perspective could you re-create the post ISP PBR route for let's say

    one destination and run a diag debug flow but also look at the layer2 arp entry diag ip arp list

    Sorry but this office is across an ocean, so it would be irresponsible to try re-create such a serious issue.  If I had lost the Domain VLAN as well because of this, I probably would have been getting on a plane to sort this out because everything would have died.

     

    emnoc wrote:
    I think think the bug might be if both interface where dhcp-dynamic-assigned & within the same subnet ( see Q1 ). if this happen that would not be a good thing.

    I agree with this.  I am sure it is a very specific set of circumstances that led to all the traffic dying as it did.

    emnoc
    Esteemed Contributor III

    They always had the same next-hop, but prior to the unannounced change they had different default gateways.  Now they have the same default gateway also.

     

    FWIW , a next-hop  == gateway

     

     

    You can open a ticket with TAC, provide the before and after cfg and before & after related network changes and see what they say.

    PCNSE 

    NSE 

    StrongSwan  

    PCNSE NSE StrongSwan
    Top Kudoed Authors