Re: linux 5.17.1 disregarding ACK values resulting in stalled TCP connections

From: Eric Dumazet
Date: Sat Apr 02 2022 - 09:24:10 EST


On Sat, Apr 2, 2022 at 1:42 AM Jaco Kroon <jaco@xxxxxxxxx> wrote:
>
> Hi Neal,
>
> On 2022/04/01 17:39, Neal Cardwell wrote:
> > On Tue, Mar 29, 2022 at 9:03 PM Jaco <jaco@xxxxxxxxx> wrote:
> > ...
> >> Connection setup:
> >>
> >> 00:56:17.055481 IP6 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110 > 2a00:1450:400c:c07::1b.25: Flags [S], seq 956633779, win 62580, options [mss 8940,nop,nop,TS val 3687705482 ecr 0,nop,wscale 7,tfo cookie f025dd84b6122510,nop,nop], length 0
> >>
> >> 00:56:17.217747 IP6 2a00:1450:400c:c07::1b.25 > 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110: Flags [S.], seq 726465675, ack 956633780, win 65535, options [mss 1440,nop,nop,TS val 3477429218 ecr 3687705482,nop,wscale 8], length 0
> >>
> >> 00:56:17.218628 IP6 2a00:1450:400c:c07::1b.25 > 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110: Flags [P.], seq 726465676:726465760, ack 956633780, win 256, options [nop,nop,TS val 3477429220 ecr 3687705482], length 84: SMTP: 220 mx.google.com ESMTP e16-20020a05600c4e5000b0038c77be9b2dsi226281wmq.72 - gsmtp
> >>
> >> 00:56:17.218663 IP6 2c0f:f720:0:3:d6ae:52ff:feb8:f27b.59110 > 2a00:1450:400c:c07::1b.25: Flags [.], ack 726465760, win 489, options [nop,nop,TS val 3687705645 ecr 3477429220], length 0
> >>
> >> This is pretty normal, we advertise an MSS of 8940 and the return is 1440, thus
> >> we shouldn't send segments larger than that, and they "can't". I need to
> >> determine if this is some form of offloading or they really are sending >1500
> >> byte frames (which I know won't pass our firewalls without fragmentation so
> >> probably some form of NIC offloading - which if it was active on older 5.8
> >> kernels did not cause problems):
> > Jaco, was there some previous kernel version on these client machines
> > where this problem did not show up? Perhaps the v5.8 version you
> > mention here? Can you please share the exact version number?
> 5.8.14
> >
> > If so, a hypothesis would be:
> >
> > (1) There is a bug in netfilter's handling of TFO connections where
> > the server sends a data packet after a TFO SYNACK, before the client
> > ACKs anything (as we see in this trace).
> >
> > This bug is perhaps similar in character to the bug fixed by Yuchung's
> > 2013 commit that Eric mentioned:
> >
> > 356d7d88e088687b6578ca64601b0a2c9d145296
> > netfilter: nf_conntrack: fix tcp_in_window for Fast Open
> >
> > (2) With kernel v5.8, TFO blackhole detection detected that in your
> > workload there were TFO connections that died due to apparent
> > blackholing (like what's shown in the trace), and dynamically disabled
> > TFO on your machines. This allowed mail traffic to flow, because the
> > netfilter bug was no longer tickled. This worked around the netfilter
> > bug.
> >
> > (3) You upgraded your client-side machine from v5.8 to v5.17, which
> > has the following commit from v5.14, which disables TFO blackhole
> > logic by default:
> > 213ad73d0607 tcp: disable TFO blackhole logic by default
> >
> > (4) Due to (3), the blackhole detection logic was no longer operative,
> > and when the netfilter bug blackholed the connection, TFO stayed
> > enabled. This caused mail traffic to Google to stall.
> >
> > This hypothesis would explain why:
> > o disabling TFO fixes this problem
> > o you are seeing this with a newer kernel (and apparently not with a
> > kernel before v5.14?)
> Agreed.
> >
> > With this hypothesis, we need several pieces to trigger this:
> >
> > (a) client side software that tries TFO to a server that supports TFO
> > (like the exim mail transfer agent you are using, connecting to
> > Google)
> >
> > (b) a client-side Linux kernel running buggy netfilter code (you are
> > running netfilter)
> >
> > (c) a client-side Linux kernel with TFO support but no blackhole
> > detection logic active (e.g. v5.14 or later, like your v5.17.1)
> >
> > That's probably a rare combination, so would explain why we have not
> > had this report before.
> >
> > Jaco, to provide some evidence for this hypothesis, can you please
> > re-enable fastopen but also enable the TFO blackhole detection that
> > was disabled in v5.14 (213ad73d0607), with something like:
> >
> > sysctl -w net.ipv4.tcp_fastopen=1
> > sysctl -w tcp_fastopen_blackhole_timeout=3600
>
> Done.
>
> Including sysctl net.netfilter.nf_conntrack_log_invalid=6- which
> generates lots of logs, something specific I should be looking for? I
> suspect these relate:
>
> [Sat Apr 2 10:31:53 2022] nf_ct_proto_6: SEQ is over the upper bound
> (over the window of the receiver) IN= OUT=bond0
> SRC=2c0f:f720:0000:0003:d6ae:52ff:feb8:f27b
> DST=2a00:1450:400c:0c08:0000:0000:0000:001a LEN=2928 TC=0 HOPLIMIT=64
> FLOWLBL=867133 PROTO=TCP SPT=48920 DPT=25 SEQ=2689938314 ACK=4200412020
> WINDOW=447 RES=0x00 ACK PSH URGP=0 OPT (0101080A2F36C1C120EDFB91) UID=8
> GID=12
> [Sat Apr 2 10:31:53 2022] nf_ct_proto_6: SEQ is over the upper bound
> (over the window of the receiver) IN= OUT=bond0
> SRC=2c0f:f720:0000:0003:d6ae:52ff:feb8:f27b
> DST=2a00:1450:400c:0c08:0000:0000:0000:001a LEN=2928 TC=0 HOPLIMIT=64
> FLOWLBL=867133 PROTO=TCP SPT=48920 DPT=25 SEQ=2689941170 ACK=4200412020
> WINDOW=447 RES=0x00 ACK PSH URGP=0 OPT (0101080A2F36C1C120EDFB91) UID=8
> GID=12
>
> (There are many more of those, and the remote side is Google in this case)
>

Great. This confirms our suspicions.

Please try the following patch that landed in 5.18-rc

f2dd495a8d589371289981d5ed33e6873df94ecc netfilter: nf_conntrack_tcp:
preserve liberal flag in tcp options

CC netfilter folks.

Condition triggering the bug :
before(seq, sender->td_maxend + 1),

I took a look at the code, and it is not clear if td_maxend is
properly setup (or if td_scale is cleared at some point while it
should not)

Alternatively, if conntracking does not know if the connection is
using wscale (or what is the scale), the "before(seq,
sender->td_maxend + 1),"
should not be evaluated/used.

Also, I do not see where td_maxend is extended in tcp_init_sender()

Probably wrong patch, just to point to the code I do not understand yet.

diff --git a/net/netfilter/nf_conntrack_proto_tcp.c
b/net/netfilter/nf_conntrack_proto_tcp.c
index 8ec55cd72572e0cca076631e2cc1c11f0c2b86f6..950082785d61b7a2768559c7500d3aee3aaea7c2
100644
--- a/net/netfilter/nf_conntrack_proto_tcp.c
+++ b/net/netfilter/nf_conntrack_proto_tcp.c
@@ -456,9 +456,10 @@ static void tcp_init_sender(struct ip_ct_tcp_state *sender,
/* SYN-ACK in reply to a SYN
* or SYN from reply direction in simultaneous open.
*/
- sender->td_end =
- sender->td_maxend = end;
- sender->td_maxwin = (win == 0 ? 1 : win);
+ sender->td_end = end;
+ sender->td_maxwin = max(win, 1U);
+ /* WIN in SYN & SYNACK is not scaled */
+ sender->td_maxend = end + sender->td_maxwin;

tcp_options(skb, dataoff, tcph, sender);
/* RFC 1323: