Re: linux 5.17.1 disregarding ACK values resulting in stalled TCP connections

From: Neal Cardwell
Date: Sat Apr 02 2022 - 12:29:35 EST


)

On Sat, Apr 2, 2022 at 4:42 AM Jaco Kroon <jaco@xxxxxxxxx> wrote:
>
> Hi Neal,
>
> On 2022/04/01 17:39, Neal Cardwell wrote:
> > On Tue, Mar 29, 2022 at 9:03 PM Jaco <jaco@xxxxxxxxx> wrote:
> > ...
> >> Connection setup:
> >>
> >> 00:56:17.055481 IP6 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110 > 2a00:1450:400c:c07::1b.25: Flags [S], seq 956633779, win 62580, options [mss 8940,nop,nop,TS val 3687705482 ecr 0,nop,wscale 7,tfo cookie f025dd84b6122510,nop,nop], length 0
> >>
> >> 00:56:17.217747 IP6 2a00:1450:400c:c07::1b.25 > 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110: Flags [S.], seq 726465675, ack 956633780, win 65535, options [mss 1440,nop,nop,TS val 3477429218 ecr 3687705482,nop,wscale 8], length 0
> >>
> >> 00:56:17.218628 IP6 2a00:1450:400c:c07::1b.25 > 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110: Flags [P.], seq 726465676:726465760, ack 956633780, win 256, options [nop,nop,TS val 3477429220 ecr 3687705482], length 84: SMTP: 220 mx.google.com ESMTP e16-20020a05600c4e5000b0038c77be9b2dsi226281wmq.72 - gsmtp
> >>
> >> 00:56:17.218663 IP6 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110 > 2a00:1450:400c:c07::1b.25: Flags [.], ack 726465760, win 489, options [nop,nop,TS val 3687705645 ecr 3477429220], length 0
> >>
> >> This is pretty normal, we advertise an MSS of 8940 and the return is 1440, thus
> >> we shouldn't send segments larger than that, and they "can't". I need to
> >> determine if this is some form of offloading or they really are sending >1500
> >> byte frames (which I know won't pass our firewalls without fragmentation so
> >> probably some form of NIC offloading - which if it was active on older 5.8
> >> kernels did not cause problems):
> > Jaco, was there some previous kernel version on these client machines
> > where this problem did not show up? Perhaps the v5.8 version you
> > mention here? Can you please share the exact version number?
> 5.8.14

Thanks for the client kernel version! (5.8.14)

> > If so, a hypothesis would be:
> >
> > (1) There is a bug in netfilter's handling of TFO connections where
> > the server sends a data packet after a TFO SYNACK, before the client
> > ACKs anything (as we see in this trace).
> >
> > This bug is perhaps similar in character to the bug fixed by Yuchung's
> > 2013 commit that Eric mentioned:
> >
> > 356d7d88e088687b6578ca64601b0a2c9d145296
> > netfilter: nf_conntrack: fix tcp_in_window for Fast Open
> >
> > (2) With kernel v5.8, TFO blackhole detection detected that in your
> > workload there were TFO connections that died due to apparent
> > blackholing (like what's shown in the trace), and dynamically disabled
> > TFO on your machines. This allowed mail traffic to flow, because the
> > netfilter bug was no longer tickled. This worked around the netfilter
> > bug.
> >
> > (3) You upgraded your client-side machine from v5.8 to v5.17, which
> > has the following commit from v5.14, which disables TFO blackhole
> > logic by default:
> > 213ad73d0607 tcp: disable TFO blackhole logic by default
> >
> > (4) Due to (3), the blackhole detection logic was no longer operative,
> > and when the netfilter bug blackholed the connection, TFO stayed
> > enabled. This caused mail traffic to Google to stall.
> >
> > This hypothesis would explain why:
> > o disabling TFO fixes this problem
> > o you are seeing this with a newer kernel (and apparently not with a
> > kernel before v5.14?)
> Agreed.
> >
> > With this hypothesis, we need several pieces to trigger this:
> >
> > (a) client side software that tries TFO to a server that supports TFO
> > (like the exim mail transfer agent you are using, connecting to
> > Google)
> >
> > (b) a client-side Linux kernel running buggy netfilter code (you are
> > running netfilter)
> >
> > (c) a client-side Linux kernel with TFO support but no blackhole
> > detection logic active (e.g. v5.14 or later, like your v5.17.1)
> >
> > That's probably a rare combination, so would explain why we have not
> > had this report before.
> >
> > Jaco, to provide some evidence for this hypothesis, can you please
> > re-enable fastopen but also enable the TFO blackhole detection that
> > was disabled in v5.14 (213ad73d0607), with something like:
> >
> > sysctl -w net.ipv4.tcp_fastopen=1
> > sysctl -w tcp_fastopen_blackhole_timeout=3600
>
> Done.

Thanks for running that experiment and reporting your data! That was
super-informative. So it seems like we have a working high-level
theory about what's going on and where, and we just need to pinpoint
the buggy lines in the netfilter conntrack code running on the mail
client machines.

> Including sysctl net.netfilter.nf_conntrack_log_invalid=6- which
> generates lots of logs, something specific I should be looking for? I
> suspect these relate:
>
> [Sat Apr 2 10:31:53 2022] nf_ct_proto_6: SEQ is over the upper bound
> (over the window of the receiver) IN= OUT=bond0
> SRC=2c0f:f720:0000:0003:d6ae:52ff:feb8:f27b
> DST=2a00:1450:400c:0c08:0000:0000:0000:001a LEN=2928 TC=0 HOPLIMIT=64
> FLOWLBL=867133 PROTO=TCP SPT=48920 DPT=25 SEQ=2689938314 ACK=4200412020
> WINDOW=447 RES=0x00 ACK PSH URGP=0 OPT (0101080A2F36C1C120EDFB91) UID=8
> GID=12
> [Sat Apr 2 10:31:53 2022] nf_ct_proto_6: SEQ is over the upper bound
> (over the window of the receiver) IN= OUT=bond0
> SRC=2c0f:f720:0000:0003:d6ae:52ff:feb8:f27b
> DST=2a00:1450:400c:0c08:0000:0000:0000:001a LEN=2928 TC=0 HOPLIMIT=64
> FLOWLBL=867133 PROTO=TCP SPT=48920 DPT=25 SEQ=2689941170 ACK=4200412020
> WINDOW=447 RES=0x00 ACK PSH URGP=0 OPT (0101080A2F36C1C120EDFB91) UID=8
> GID=12
>
> (There are many more of those, and the remote side is Google in this case)

FWIW those log entries indicate netfilter on the mail client machine
dropping consecutive outbound skbs with 2*MSS of payload. So that
explains the large consecutive losses of client data packets to the
e-mail server. That seems to confirm my earlier hunch that those drops
of consecutive client data packets "do not look like normal congestive
packet loss".

neal