Re: [PATCH] Revert "tcp: simplify window probe aborting on USER_TIMEOUT"

From: Neal Cardwell
Date: Mon Jan 11 2021 - 09:59:46 EST


On Fri, Jan 8, 2021 at 11:38 PM Enke Chen <enkechen2020@xxxxxxxxx> wrote:
>
> From: Enke Chen <enchen@xxxxxxxxxxxxxxxxxxxx>
>
> This reverts commit 9721e709fa68ef9b860c322b474cfbd1f8285b0f.
>
> With the commit 9721e709fa68 ("tcp: simplify window probe aborting
> on USER_TIMEOUT"), the TCP session does not terminate with
> TCP_USER_TIMEOUT when data remain untransmitted due to zero window.
>
> The number of unanswered zero-window probes (tcp_probes_out) is
> reset to zero with incoming acks irrespective of the window size,
> as described in tcp_probe_timer():
>
> RFC 1122 4.2.2.17 requires the sender to stay open indefinitely
> as long as the receiver continues to respond probes. We support
> this by default and reset icsk_probes_out with incoming ACKs.
>
> This counter, however, is the wrong one to be used in calculating the
> duration that the window remains closed and data remain untransmitted.
> Thanks to Jonathan Maxwell <jmaxwell37@xxxxxxxxx> for diagnosing the
> actual issue.
>
> Cc: stable@xxxxxxxxxxxxxxx
> Fixes: 9721e709fa68 ("tcp: simplify window probe aborting on USER_TIMEOUT")
> Reported-by: William McCall <william.mccall@xxxxxxxxx>
> Signed-off-by: Enke Chen <enchen@xxxxxxxxxxxxxxxxxxxx>
> ---

I ran this revert commit through our packetdrill TCP tests, and it's
causing failures in a ZWP/USER_TIMEOUT test due to interactions with
this Jan 2019 patch:

7f12422c4873e9b274bc151ea59cb0cdf9415cf1
tcp: always timestamp on every skb transmission

The issue seems to be that after 7f12422c4873 the skb->skb_mstamp_ns
is set on every transmit attempt. That means that even skbs that are
not successfully transmitted have a non-zero skb_mstamp_ns. That means
that if ZWPs are repeatedly failing to be sent due to severe local
qdisc congestion, then at this point in the code the start_ts is
always only 500ms in the past (from TCP_RESOURCE_PROBE_INTERVAL =
500ms). That means that if there is severe local qdisc congestion a
USER_TIMEOUT above 500ms is a NOP, and the socket can live far past
the USER_TIMEOUT.

It seems we need a slightly different approach than the revert in this commit.

neal