We have a large transfer that we are sending through our FortiGate but it always fails because the sessions are getting dropped on the firewall.
We are using MSP360 to transfer to Backblaze (B2). The backups are broken into different jobs with the largest being ~12TB, and the smallest being ~400GB (total for all is ~40TB). The jobs are all set to manual, and never run together.
No matter which one we start, it always starts just fine but eventually we will begin seeing a flood of TCP retransmits, and shortly after, the job fails. The firewall shows an implicit deny because of "no session matched". This is very strange, because on the sniffer we can see the transfer was actively going. Even in the backup logs, we can see it was in the middle of sending the multi-part upload, then suddenly it logs the error "Operation too slow. Less than 0 bytes transferred the last 60 seconds".
The biggest problem trying to troubleshoot this is that it never fails at the same place. Sometimes a job will run for 5 minutes, and sometimes it'll run for 2 days. This also makes captures extremely difficult because even if we slice to header, it's still 16K packets per second. Nevertheless, we were lucky enough to get a capture during one of the times when the job failed quickly, and it does show data being transferred and then suddenly stops. What's real interesting, is no matter how many connections it had open, *all* of them fail at the exact same time. This is corroborated by the firewall logs which then contains thousands of the "no session matched" errors (presumably from the server retrying before it gives up).
We know it's the firewall because if we bypass it, then the jobs work fine (all of them). We do have a ticket open with Fortinet but, TBH, I think they're running out of ideas because it's gotten to the point where they're not even reading the notes/updates anymore and we we're losing hope in a good resolution.
The firewall is a 61F, which we know has an inspection limit of 700Mbps, but we have the backup capped at 500Mbps, and we've created a policy rule that exempts these jobs from inspection altogether (Fortinet verified the policy was created correctly and is correctly matching the traffic). There is hardly any other traffic on this network and the CPU never even reaches 10%. Also, we have the TCP timeout on this policy set to 2 hours.
Here is some additional information:
- FortiGate version = 7.2.3
- Internet bandwidth = 1Gbps synchronous
- Destination IPs = Multiple (but we have all of them in an address group)
- Destination port = 443
- SSL policy = no-inspection (the factory policy)
- Max number of concurrent sessions (observed) = 68
- Maximum number of concurrent connections (configured) = 64 (but we've tried 8,12,16,24,32)
- Frequency of session rotation (observed) = 1 new session every 5-6 minutes
- Minimum chunk size (configured) = 60MB (but we've tried 10,20,64,100,120,250)
Any ideas on what to try next would be appreciated! If you need more info, just let me know.