Large transfer fails becuase sessions keep getting dropped

Brian_M · ‎01-29-2023

We have a large transfer that we are sending through our FortiGate but it always fails because the sessions are getting dropped on the firewall.

We are using MSP360 to transfer to Backblaze (B2). The backups are broken into different jobs with the largest being ~12TB, and the smallest being ~400GB (total for all is ~40TB). The jobs are all set to manual, and never run together.

No matter which one we start, it always starts just fine but eventually we will begin seeing a flood of TCP retransmits, and shortly after, the job fails. The firewall shows an implicit deny because of "no session matched". This is very strange, because on the sniffer we can see the transfer was actively going. Even in the backup logs, we can see it was in the middle of sending the multi-part upload, then suddenly it logs the error "Operation too slow. Less than 0 bytes transferred the last 60 seconds".

The biggest problem trying to troubleshoot this is that it never fails at the same place. Sometimes a job will run for 5 minutes, and sometimes it'll run for 2 days. This also makes captures extremely difficult because even if we slice to header, it's still 16K packets per second. Nevertheless, we were lucky enough to get a capture during one of the times when the job failed quickly, and it does show data being transferred and then suddenly stops. What's real interesting, is no matter how many connections it had open, *all* of them fail at the exact same time. This is corroborated by the firewall logs which then contains thousands of the "no session matched" errors (presumably from the server retrying before it gives up).

We know it's the firewall because if we bypass it, then the jobs work fine (all of them). We do have a ticket open with Fortinet but, TBH, I think they're running out of ideas because it's gotten to the point where they're not even reading the notes/updates anymore and we we're losing hope in a good resolution.

The firewall is a 61F, which we know has an inspection limit of 700Mbps, but we have the backup capped at 500Mbps, and we've created a policy rule that exempts these jobs from inspection altogether (Fortinet verified the policy was created correctly and is correctly matching the traffic). There is hardly any other traffic on this network and the CPU never even reaches 10%. Also, we have the TCP timeout on this policy set to 2 hours.

Here is some additional information:

FortiGate version = 7.2.3
Internet bandwidth = 1Gbps synchronous
Destination IPs = Multiple (but we have all of them in an address group)
Destination port = 443
SSL policy = no-inspection (the factory policy)
Max number of concurrent sessions (observed) = 68
Maximum number of concurrent connections (configured) = 64 (but we've tried 8,12,16,24,32)
Frequency of session rotation (observed) = 1 new session every 5-6 minutes
Minimum chunk size (configured) = 60MB (but we've tried 10,20,64,100,120,250)

Any ideas on what to try next would be appreciated! If you need more info, just let me know.

Thank you!!

Brian_M · ‎02-07-2023

I wanted to post a followup to this because disabling the TCP-SYN requirement for the policy seems to have done the trick. We've had more than 100TB pass through the firewall on this policy and there have not been any more errors.

Since that policy only matches traffic for our backups (and we don't have inspection on it anyways), we're going to leave the setting in place. For anyone else that's experiencing dropped sessions related to long-term connections, if playing with the session timers don't work for you, then you can try enabling TCP sessions without SYN.

config system settings

set tcp-session-without-syn enable

end

config firewall policy

edit {id}

set tcp-session-without-syn all

end

Credit to @Stelios_FTNT because there's also a more detailed forum post related to this setting.

https://community.fortinet.com/t5/FortiGate/Technical-Note-Enable-creation-of-TCP-session-on-the-fir...

View solution in original post

AEK · ‎01-30-2023

Hello Brian

If you have SD-WAN interface then can you try without SD-WAN?

AEK

gfleming · ‎01-30-2023

If Backblaze is setting up new parallel sessions (sounds like it given your frequency of session rotation count) to service the backups, then is it possible it is going back to old sessions that it was previously using but they are now timed out and no longer in the FGT session table? Do your logs show this?

You should be able to confirm by seeing which Backblaze IPs are causing the "no session matched" logs. You should also be able to go back to your historical traffic logs and see if there was a previous session opened to the same IP for this backup session.

If so, from Backblaze's persepctive this IP is still serviceable for backup communications and it skips the TCP setup assuming your client still has the session opened (no one send a FIN or RST packet.... yet). But from FortiGate perspective it's not a current connection as its been timed out so you get the errors and RSTs from the FortiGate because the FortiGate wants TCP handshake now.

I would assume this is what's happening and if you can corroborate it in your logs per above then the best thing to do is create a new policy for Backblaze using ISDB or manually specifying Backblaze IP ranges and setting the TTL (set session-ttl ) for sessions using this policy so something like 72 hours or longer possibly.

Cheers,
Graham

Brian_M · ‎01-30-2023

Thank you for the suggestions!

I don't have SD-WAN enabled, but I did find the articles talking about SD-WAN in conjunction with the "no session matched" errors.

I think Graham is on the right track and that the backup software is attempting to re-use old connections, and indeed, I can even see this in the back logs: "Re-using existing connection! (#736) with host s3.us-west-000.backblazeb2.com"

This did lead us to the session timeouts and we tried high numbers like 86400 (which didn't work) and even tested never (set session-ttl never), but that didn't work either. Unsurprisingly, this also left a lot of orphaned sessions on the firewall.

We decided to come at this from the opposite direction and disabled the TCP-SYN requirement on the policy (set tcp-session-without-syn all). *So far* that seems to be working, and we have a backup that's been running for nearly 24 hours without error.

Thank you for the help, and I will keep you informed if something changes!

Brian_M · ‎02-07-2023