Re: [PATCH RFC net-next 1/4] tcp: implement RFC 7323 window retraction receiver requirements
From: Stefano Brivio
Date: Mon Feb 23 2026 - 17:26:54 EST
Hi Simon,
It all makes sense to me at a quick look, I have just some nits and one
more substantial worry, below:
On Fri, 20 Feb 2026 00:55:14 +0100
Simon Baatz via B4 Relay <devnull+gmbnomis.gmail.com@xxxxxxxxxx> wrote:
> From: Simon Baatz <gmbnomis@xxxxxxxxx>
>
> By default, the Linux TCP implementation does not shrink the
> advertised window (RFC 7323 calls this "window retraction") with the
> following exceptions:
>
> - When an incoming segment cannot be added due to the receive buffer
> running out of memory. Since commit 8c670bdfa58e ("tcp: correct
> handling of extreme memory squeeze") a zero window will be
> advertised in this case. It turns out that reaching the required
> "memory pressure" is very easy when window scaling is in use. In the
> simplest case, sending a sufficient number of segments smaller than
> the scale factor to a receiver that does not read data is enough.
>
> Since commit 1d2fbaad7cd8 ("tcp: stronger sk_rcvbuf checks") this
> happens much earlier than before, leading to regressions (the test
> suite of the Valkey project does not pass because of a TCP
> connection that is no longer bi-directional).
Ouch. By the way, that same commit helped us unveil an issue (at least
in the sense of RFC 9293, 3.8.6) we fixed in passt:
https://passt.top/passt/commit/?id=8d2f8c4d0fb58d6b2011e614bc7d7ff9dab406b3
> - Commit b650d953cd39 ("tcp: enforce receive buffer memory limits by
> allowing the tcp window to shrink") addressed the "eating memory"
> problem by introducing a sysctl knob that allows shrinking the
> window before running out of memory.
>
> However, RFC 7323 does not only state that shrinking the window is
> necessary in some cases, it also formulates requirements for TCP
> implementations when doing so (Section 2.4).
>
> This commit addresses the receiver-side requirements: After retracting
> the window, the peer may have a snd_nxt that lies within a previously
> advertised window but is now beyond the retracted window. This means
> that all incoming segments (including pure ACKs) will be rejected
> until the application happens to read enough data to let the peer's
> snd_nxt be in window again (which may be never).
>
> To comply with RFC 7323, the receiver MUST honor any segment that
> would have been in window for any ACK sent by the receiver and, when
> window scaling is in effect, SHOULD track the maximum window sequence
> number it has advertised. This patch tracks that maximum window
> sequence number throughout the connection and uses it in
> tcp_sequence() when deciding whether a segment is acceptable.
> Acceptability of data is not changed.
>
> Fixes: 8c670bdfa58e ("tcp: correct handling of extreme memory squeeze")
> Fixes: b650d953cd39 ("tcp: enforce receive buffer memory limits by allowing the tcp window to shrink")
> Signed-off-by: Simon Baatz <gmbnomis@xxxxxxxxx>
> ---
> Documentation/networking/net_cachelines/tcp_sock.rst | 1 +
> include/linux/tcp.h | 1 +
> include/net/tcp.h | 14 ++++++++++++++
> net/ipv4/tcp_fastopen.c | 1 +
> net/ipv4/tcp_input.c | 6 ++++--
> net/ipv4/tcp_minisocks.c | 1 +
> net/ipv4/tcp_output.c | 12 ++++++++++++
> .../selftests/net/packetdrill/tcp_rcv_big_endseq.pkt | 2 +-
> 8 files changed, 35 insertions(+), 3 deletions(-)
>
> diff --git a/Documentation/networking/net_cachelines/tcp_sock.rst b/Documentation/networking/net_cachelines/tcp_sock.rst
> index 563daea10d6c5c074f004cb1b8574f5392157abb..fecf61166a54ee2f64bcef5312c81dcc4aa9a124 100644
> --- a/Documentation/networking/net_cachelines/tcp_sock.rst
> +++ b/Documentation/networking/net_cachelines/tcp_sock.rst
> @@ -121,6 +121,7 @@ u64 delivered_mstamp read_write
> u32 rate_delivered read_mostly tcp_rate_gen
> u32 rate_interval_us read_mostly rate_delivered,rate_app_limited
> u32 rcv_wnd read_write read_mostly tcp_select_window,tcp_receive_window,tcp_fast_path_check
> +u32 rcv_mwnd_seq read_write tcp_select_window
> u32 write_seq read_write tcp_rate_check_app_limited,tcp_write_queue_empty,tcp_skb_entail,forced_push,tcp_mark_push
> u32 notsent_lowat read_mostly tcp_stream_memory_free
> u32 pushed_seq read_write tcp_mark_push,forced_push
> diff --git a/include/linux/tcp.h b/include/linux/tcp.h
> index f72eef31fa23cc584f2f0cefacdc35cae43aa52d..5a943b12d4c050a980b4cf81635b9fa2f0036283 100644
> --- a/include/linux/tcp.h
> +++ b/include/linux/tcp.h
> @@ -271,6 +271,7 @@ struct tcp_sock {
> u32 lsndtime; /* timestamp of last sent data packet (for restart window) */
> u32 mdev_us; /* medium deviation */
> u32 rtt_seq; /* sequence number to update rttvar */
> + u32 rcv_mwnd_seq; /* Maximum window sequence number (RFC 7323, section 2.4) */
Nit: tab between ; and /* for consistency (I would personally prefer
the comment style as you see on 'highest_sack' but I don't think it's
enforced anymore).
Second nit: mentioning RFC 7323, section 2.4 could be a bit misleading
here because the relevant paragraph there covers a very specific case of
window retraction, caused by quantisation error from window scaling,
which is not the most common case here. I couldn't quickly find a better
reference though.
More importantly: do we need to restore this on a connection that's
being dumped and recreated using TCP_REPAIR, or will things still work
(even though sub-optimally) if we lose this value?
Other window values that *need* to be dumped and restored are currently
available via TCP_REPAIR_WINDOW socket option, and they are listed in
do_tcp_getsockopt(), net/ipv4/tcp.c:
opt.snd_wl1 = tp->snd_wl1;
opt.snd_wnd = tp->snd_wnd;
opt.max_window = tp->max_window;
opt.rcv_wnd = tp->rcv_wnd;
opt.rcv_wup = tp->rcv_wup;
CRIU uses it to checkpoint and restore established connections, and
passt uses it to migrate them to a different host:
https://criu.org/TCP_connection
https://passt.top/passt/tree/tcp.c?id=02af38d4177550c086bae54246fc3aaa33ddc018#n3063
If it's strictly needed to preserve functionality, we would need to add
it to struct tcp_repair_window, notify CRIU maintainers (or send them a
patch), and add this in passt as well (I can take care of it).
Strictly speaking, in case, this could be considered a breaking change
for userspace, but I don't see how to avoid it, so I'd just make sure
it doesn't impact users as TCP_REPAIR has just a couple of (known!)
projects relying on it.
An alternative would be to have a special, initial value representing
the fact that this value was lost, but it looks really annoying to not
be able to use a u32 for it.
Disregard all this if the correct value is not strictly needed for
functionality, of course. I haven't tested things (not yet, at least).
> u64 tcp_wstamp_ns; /* departure time for next sent data packet */
> u64 accecn_opt_tstamp; /* Last AccECN option sent timestamp */
> struct list_head tsorted_sent_queue; /* time-sorted sent but un-SACKed skbs */
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index 40e72b9cb85f08714d3f458c0bd1402a5fb1eb4e..e1944d504823d5f8754d85bfbbf3c9630d2190ac 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -912,6 +912,20 @@ static inline u32 tcp_receive_window(const struct tcp_sock *tp)
> return (u32) win;
> }
>
> +/* Compute the maximum receive window we ever advertised.
> + * Rcv_nxt can be after the window if our peer push more data
s/push/pushes/
s/Rcv_nxt/rcv_nxt/ (useful for grepping)
> + * than the offered window.
> + */
> +static inline u32 tcp_max_receive_window(const struct tcp_sock *tp)
> +{
> + s32 win = tp->rcv_mwnd_seq - tp->rcv_nxt;
> +
> + if (win < 0)
> + win = 0;
I must be missing something but... if the sequence is about to wrap,
we'll return 0 here. Is that intended?
Doing the subtraction unsigned would have looked more natural to me,
but I didn't really think it through.
> + return (u32) win;
Kernel coding style doesn't usually include a space between cast and
identifier.
> +}
> +
> +
> /* Choose a new window, without checks for shrinking, and without
> * scaling applied to the result. The caller does these things
> * if necessary. This is a "raw" window selection.
> diff --git a/net/ipv4/tcp_fastopen.c b/net/ipv4/tcp_fastopen.c
> index b30090cff3cf7d925dc46694860abd3ca5516d70..f034ef6e3e7b54bf73c77fd2bf1d3090c75dbfc6 100644
> --- a/net/ipv4/tcp_fastopen.c
> +++ b/net/ipv4/tcp_fastopen.c
> @@ -377,6 +377,7 @@ static struct sock *tcp_fastopen_create_child(struct sock *sk,
>
> tcp_rsk(req)->rcv_nxt = tp->rcv_nxt;
> tp->rcv_wup = tp->rcv_nxt;
> + tp->rcv_mwnd_seq = tp->rcv_wup + tp->rcv_wnd;
> /* tcp_conn_request() is sending the SYNACK,
> * and queues the child into listener accept queue.
> */
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index e7b41abb82aad33d8cab4fcfa989cc4771149b41..af9dd51256b01fd31d9e390d69dcb1d1700daf1b 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -4865,8 +4865,8 @@ static enum skb_drop_reason tcp_sequence(const struct sock *sk,
> if (before(end_seq, tp->rcv_wup))
> return SKB_DROP_REASON_TCP_OLD_SEQUENCE;
>
> - if (after(end_seq, tp->rcv_nxt + tcp_receive_window(tp))) {
> - if (after(seq, tp->rcv_nxt + tcp_receive_window(tp)))
> + if (after(end_seq, tp->rcv_nxt + tcp_max_receive_window(tp))) {
> + if (after(seq, tp->rcv_nxt + tcp_max_receive_window(tp)))
> return SKB_DROP_REASON_TCP_INVALID_SEQUENCE;
>
> /* Only accept this packet if receive queue is empty. */
> @@ -6959,6 +6959,7 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
> */
> WRITE_ONCE(tp->rcv_nxt, TCP_SKB_CB(skb)->seq + 1);
> tp->rcv_wup = TCP_SKB_CB(skb)->seq + 1;
> + tp->rcv_mwnd_seq = tp->rcv_wup + tp->rcv_wnd;
>
> /* RFC1323: The window in SYN & SYN/ACK segments is
> * never scaled.
> @@ -7071,6 +7072,7 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
> WRITE_ONCE(tp->rcv_nxt, TCP_SKB_CB(skb)->seq + 1);
> WRITE_ONCE(tp->copied_seq, tp->rcv_nxt);
> tp->rcv_wup = TCP_SKB_CB(skb)->seq + 1;
> + tp->rcv_mwnd_seq = tp->rcv_wup + tp->rcv_wnd;
>
> /* RFC1323: The window in SYN & SYN/ACK segments is
> * never scaled.
> diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
> index ec128865f5c029c971eb00cb9ee058b742efafd1..df95d8b6dce5c746e5e34545aa75a96080cc752d 100644
> --- a/net/ipv4/tcp_minisocks.c
> +++ b/net/ipv4/tcp_minisocks.c
> @@ -604,6 +604,7 @@ struct sock *tcp_create_openreq_child(const struct sock *sk,
> newtp->window_clamp = req->rsk_window_clamp;
> newtp->rcv_ssthresh = req->rsk_rcv_wnd;
> newtp->rcv_wnd = req->rsk_rcv_wnd;
> + newtp->rcv_mwnd_seq = newtp->rcv_wup + req->rsk_rcv_wnd;
> newtp->rx_opt.wscale_ok = ireq->wscale_ok;
> if (newtp->rx_opt.wscale_ok) {
> newtp->rx_opt.snd_wscale = ireq->snd_wscale;
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 326b58ff1118d02fc396753d56f210f9d3007c7f..50774443f6ae0ca83f360c7fc3239184a1523e1b 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -274,6 +274,15 @@ void tcp_select_initial_window(const struct sock *sk, int __space, __u32 mss,
> }
> EXPORT_IPV6_MOD(tcp_select_initial_window);
>
> +/* Check if we need to update the maximum window sequence number */
> +static inline void tcp_update_max_wnd_seq(struct tcp_sock *tp)
> +{
> + u32 wre = tp->rcv_wup + tp->rcv_wnd;
> +
> + if (after(wre, tp->rcv_mwnd_seq))
> + tp->rcv_mwnd_seq = wre;
> +}
> +
> /* Chose a new window to advertise, update state in tcp_sock for the
> * socket, and return result with RFC1323 scaling applied. The return
> * value can be stuffed directly into th->window for an outgoing
> @@ -293,6 +302,7 @@ static u16 tcp_select_window(struct sock *sk)
> tp->pred_flags = 0;
> tp->rcv_wnd = 0;
> tp->rcv_wup = tp->rcv_nxt;
> + tcp_update_max_wnd_seq(tp);
> return 0;
> }
>
> @@ -316,6 +326,7 @@ static u16 tcp_select_window(struct sock *sk)
>
> tp->rcv_wnd = new_win;
> tp->rcv_wup = tp->rcv_nxt;
> + tcp_update_max_wnd_seq(tp);
>
> /* Make sure we do not exceed the maximum possible
> * scaled window.
> @@ -4169,6 +4180,7 @@ static void tcp_connect_init(struct sock *sk)
> else
> tp->rcv_tstamp = tcp_jiffies32;
> tp->rcv_wup = tp->rcv_nxt;
> + tp->rcv_mwnd_seq = tp->rcv_nxt + tp->rcv_wnd;
> WRITE_ONCE(tp->copied_seq, tp->rcv_nxt);
>
> inet_csk(sk)->icsk_rto = tcp_timeout_init(sk);
> diff --git a/tools/testing/selftests/net/packetdrill/tcp_rcv_big_endseq.pkt b/tools/testing/selftests/net/packetdrill/tcp_rcv_big_endseq.pkt
> index 3848b419e68c3fc895ad736d06373fc32f3691c1..1a86ee5093696deb316c532ca8f7de2bbf5cd8ea 100644
> --- a/tools/testing/selftests/net/packetdrill/tcp_rcv_big_endseq.pkt
> +++ b/tools/testing/selftests/net/packetdrill/tcp_rcv_big_endseq.pkt
> @@ -36,7 +36,7 @@
>
> +0 read(4, ..., 100000) = 4000
>
> -// If queue is empty, accept a packet even if its end_seq is above wup + rcv_wnd
> +// If queue is empty, accept a packet even if its end_seq is above rcv_mwnd_seq
> +0 < P. 4001:54001(50000) ack 1 win 257
> +0 > . 1:1(0) ack 54001 win 0
>
>
--
Stefano