Re: linux 5.17.1 disregarding ACK values resulting in stalled TCP connections
From: Neal Cardwell
Date: Fri Apr 01 2022 - 11:44:47 EST
On Thu, Mar 31, 2022 at 7:06 PM Jaco Kroon <jaco@xxxxxxxxx> wrote:
>
> Hi Neal,
>
> This sniff was grabbed ON THE CLIENT HOST. There is no middlebox or
> anything between the sniffer and the client. Only the firewall on the
> host itself, where we've already establish the traffic is NOT DISCARDED
> (at least not in filter/INPUT).
Yes, understood. Please excuse my general use of the term
"firewalls/middleboxes" even where in some contexts it's clear the
"middleboxes" aspect of that term could not apply. :-)
> Setup on our end:
>
> 2 x routers, usually each with a direct peering with Google (which is
> being ignored at the moment so instead traffic is incoming via IPT over DD).
>
> Connected via switch to
>
> 2 x firewalls, of which ONE is active (they have different networks
> behind them, and could be active / standby for different networks behind
> them - avoiding active-active because conntrackd is causing more trouble
> than it's worth), Linux hosts, using netfilter, has been operating for
> years, no recent kernel upgrades.
>
> 4 x hosts in mail cluster, one of which you're looking at here.
>
> On 2022/03/31 17:41, Neal Cardwell wrote:
> > On Wed, Mar 30, 2022 at 9:04 AM Jaco Kroon <jaco@xxxxxxxxx> wrote:
> > ...
> >> When you state sane/normal, do you mean there is fault with the other
> >> frames that could not be explained by packet loss in one or both of the
> >> directions?
> > Yes.
> >
> > (1) If you look at the attached trace time/sequence plots (from
> > tcptrace and xplot.org) there are several behaviors that do not look
> > like normal congestive packet loss:
> OK. I'm not 100% sure how these plots of yours work, but let's see if I
> can follow your logic here - they mostly make sense. A legend would
> probably help. As I understand the white dots are original transmits,
> green is what has been ACKED. R is retransmits ... what's the S?
"S" is "SACKed", or selectively acknowledged. The SACK blocks below
the green ACK lines are DSACK blocks, for "Duplicate SACKs",
indicating the receiver has already received that sequence range.
> What's the yellow line (I'm guessing receive window as advertised by the
> server)?
Yes, the yellow line is the right edge of the receive window of the server.
> > (a) Literally *all* original transmissions (white segments in the
> > plot) of packets after client sequence 66263 appear lost (are not
> > ACKed). Congestion generally does not behave like that. But broken
> > firewalls/middleboxes do.
> > (See netdev-2022-03-29-tcp-disregarded-acks-zoomed-out.png )
>
> Agreed. So could it be that something in the transit path towards
> Google is actually dropping all of that?
It could be. Or it could be a firewall/middlebox.
> As stated - I highly doubt this is on our network unless newer kernel
> (on mail cluster) is doing stuff which is causing older netfilter to
> drop perhaps? But this doesn't explain why newer kernel retransmits
> data for which it received an ACK.
Yes, I agree that the biggest problem to focus on is the TCP code in
the kernel retransmitting data for which the NIC is receiving ACKs.
> >
> > (b) When the client is retransmitting packets, only packets at
> > exactly snd_una are ACKed. The packets beyond that point are always
> > un-ACKed. Again sounds like a broken firewall/middlebox.
> > (See netdev-2022-03-29-tcp-disregarded-acks-zoomed-in.png )
> No middlebox between packet sniffer and client ... client here is linux
> 5.17.1. Brings me back to the only thing that could be dropping the
> traffic is netfilter on the host, or the kernel doesn't like something
> about the ACK, or kernel is doing something else wrong as a result of
> TFO. I'm not sure which option I like less. Unfortunately I also use
> netfilter for redirecting traffic into haproxy here so can't exactly
> just switch off netfilter.
Given the most problematic aspect of the trace, where the client-side
TCP connection is repeatedly retransmitting packets for which ACKs are
arriving at the NIC (and captured by tcpdump), it seems some software
in your kernel is dropping packets between the network device and the
TCP layer. Given that you mention "the only thing that could be
dropping the traffic is netfilter on the host", it seems like the
netfilter rules or software are buggy.
A guess would be that the netfilter code is getting into a bad state
due to the TFO behavior where there is a data packet arriving from the
server immediately after the SYN/ACK and just before the client sends
its first ACK:
00:00:00.000000 IP6 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.48590 >
2a00:1450:4013:c16::1a.25: S 3451342529:3451342529(0) win 62580 <mss
8940,sackOK,TS val 331187616 ecr 0,nop,wscale 7,Unknown Option
3472da7bfe84[|tcp]>
00:00:00.164295 IP6 2a00:1450:4013:c16::1a.25 >
2c0f:f720:0:3:d6ae:52ff:feb8:f27b.48590: S. 2699962254:2699962254(0)
ack 3451342530 win 65535 <mss 1440,sackOK,TS val 1206542770 ecr
331187616,nop,wscale 8>
# this one is perhaps confusing netfilter?:
00:00:00.001641 IP6 2a00:1450:4013:c16::1a.25 >
2c0f:f720:0:3:d6ae:52ff:feb8:f27b.48590: P. 1:89(88) ack 1 win 256
<nop,nop,TS val 1206542772 ecr 331187616>
00:00:00.000035 IP6 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.48590 >
2a00:1450:4013:c16::1a.25: . 1:1(0) ack 89 win 489 <nop,nop,TS val
331187782 ecr 1206542772>
00:00:00.000042 IP6 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.48590 >
2a00:1450:4013:c16::1a.25: P. 1:24(23) ack 89 win 489 <nop,nop,TS val
331187782 ecr 1206542772>
Re "so can't exactly just switch off netfilter", are there any other
counters or logs you can somehow check for netfilter drops?
> >
> > (c) After the client receives the server's "ack 73403", the client
> > ignores/drops all other incoming packets that show up in the trace.
>
> Agreed. However, if I read your graph correctly, it gets an ACK for
> frame X at ~3.8s into the connection, then for X+2 at 4s, but it keeps
> retransmitting X+2, not X+1?
At t=4s, as I discussed below there are two ACKs that arrive
back-to-back, where the client TCP apparently processes the first but
not the second. That's why it keeps retransmitting the packet beyond
the first ACk but not beyond the second ACK.
>
> >
> > As Eric notes, this doesn't look like a PAWS issue. And it
> > doesn't look like a checksum or sequence/ACK validation issue. The
> > client starts ignoring ACKs between two ACKs that have correct
> > checksums, valid ACK numbers, and valid (identical) sequence numbers
> > and TS val and ecr values (here showing absolute sequence/ACK
> > numbers):
> I'm not familiar with PAWS here. Assuming that the green line is ACKs,
> then at around 4s we get an ACK that basically ACKs two frames in one
> (which is fine from my understanding of TCP), and then the second of
> these frames keeps getting retransmitted going forward, so it's almost
> like the kernel ACKs the *first* of these two frames but not the second.
Again, there are two ACKs, where the client TCP apparently processes
the first but not the second, as discussed here:
> >
> > (i) The client processes this ACK and uses it to advance snd_una:
> > 17:46:49.889911 IP6 (flowlabel 0x97427, hlim 61, next-header TCP
> > (6) payload length: 32) 2a00:1450:4013:c16::1a.25 >
> > 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.48590: . cksum 0x7005 (correct)
> > 2699968514:2699968514(0) ack 3451415932 win 830 <nop,nop,TS val
> > 1206546583 ecr 331191428>
>
> >
> > (ii) The client ignores this ACK and all later ACKs:
> > 17:46:49.889912 IP6 (flowlabel 0x97427, hlim 61, next-header TCP
> > (6) payload length: 32) 2a00:1450:4013:c16::1a.25 >
> > 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.48590: . cksum 0x6a66 (correct)
> > 2699968514:2699968514(0) ack 3451417360 win 841 <nop,nop,TS val
> > 1206546583 ecr 331191428>
> >
Here are those same two ACKs again, shown with absolute time and
relative sequence numbers, to make them easier to parse:
(i) The client processes this ACK and uses it to advance snd_una:
17:46:49.889911 IP6 2a00:1450:4013:c16::1a.25 >
2c0f:f720:0:3:d6ae:52ff:feb8:f27b.48590: . 6260:6260(0) ack 73403 win
830 <nop,nop,TS val 1206546583 ecr 331191428>
(ii) The client ignores this ACK and all later ACKs:
17:46:49.889912 IP6 2a00:1450:4013:c16::1a.25 >
2c0f:f720:0:3:d6ae:52ff:feb8:f27b.48590: . 6260:6260(0) ack 74831 win
841 <nop,nop,TS val 1206546583 ecr 331191428>
neal