Re: linux 5.17.1 disregarding ACK values resulting in stalled TCP connections

From: Neal Cardwell
Date: Sat Apr 02 2022 - 14:31:21 EST


On Sat, Apr 2, 2022 at 12:32 PM Eric Dumazet <edumazet@xxxxxxxxxx> wrote:
>
> On Sat, Apr 2, 2022 at 9:29 AM Neal Cardwell <ncardwell@xxxxxxxxxx> wrote:
> >
> > FWIW those log entries indicate netfilter on the mail client machine
> > dropping consecutive outbound skbs with 2*MSS of payload. So that
> > explains the large consecutive losses of client data packets to the
> > e-mail server. That seems to confirm my earlier hunch that those drops
> > of consecutive client data packets "do not look like normal congestive
> > packet loss".
>
>
> This also explains why we have all these tiny 2-MSS packets in the pcap.
>
> Under normal conditions, autocorking should kick in, allowing TCP to
> build bigger TSO packets.

I have not looked at the conntrack code before today, but AFAICT this
is the buggy section of nf_conntrack_proto_tcp.c:

} else if (((state->state == TCP_CONNTRACK_SYN_SENT
&& dir == IP_CT_DIR_ORIGINAL)
|| (state->state == TCP_CONNTRACK_SYN_RECV
&& dir == IP_CT_DIR_REPLY))
&& after(end, sender->td_end)) {
/*
* RFC 793: "if a TCP is reinitialized ... then it need
* not wait at all; it must only be sure to use sequence
* numbers larger than those recently used."
*/
sender->td_end =
sender->td_maxend = end;
sender->td_maxwin = (win == 0 ? 1 : win);

tcp_options(skb, dataoff, tcph, sender);

Note that the tcp_options() function implicitly assumes it is being
called on a SYN, because it sets state->td_scale to 0 and only sets
state->td_scale to something non-zero if it sees a wscale option. So
if we ever call that on an skb that's not a SYN, we will forget that
the connection is using the wscale option.

But at this point in the code it is calling tcp_options() without
first checking that this is a SYN.

For this TFO scenario like the one in the trace, where the server
sends its first data packet after the SYNACK packet and before the
client's first ACK, presumably the conntrack state machine is
(correctly) SYN_RECV, and then (incorrectly) executes this code,
including the call to tcp_options(), on this first data packet, which
has no SYN bit, and no wscale option. Thus tcp_options() zeroes out
the server's sending state td_scale and does not set it to a non-zero
value. So now conntrack thinks the server is not using the wscale
option. So when conntrack interprets future receive windows from the
server, it does not scale them (with: win <<= sender->td_scale;), so
in this scenario the estimated right edge of the server's receive
window (td_maxend) is never advanced past the roughly 64KB value
offered in the SYN. Thus when the client sends data packets beyond
64KBytes, conntrack declares them invalid and drops them, due to
failing the condition Eric noted above:

before(seq, sender->td_maxend + 1),

This explains my previous observation that the client's original data
packet transmissions are always dropped after the first 64KBytes.

Someone more familiar with conntrack may have a good idea about how to
best fix this?

neal