Re: [PATCH RFC net-next 1/4] tcp: implement RFC 7323 window retraction receiver requirements

From: Stefano Brivio

Date: Wed Feb 25 2026 - 16:33:59 EST


On Tue, 24 Feb 2026 19:07:45 +0100
Simon Baatz <gmbnomis@xxxxxxxxx> wrote:

> Hi Stefano,
>
> On Mon, Feb 23, 2026 at 11:26:40PM +0100, Stefano Brivio wrote:
> > Hi Simon,
> >
> > It all makes sense to me at a quick look, I have just some nits and one
> > more substantial worry, below:
> >
> > On Fri, 20 Feb 2026 00:55:14 +0100
> > Simon Baatz via B4 Relay <devnull+gmbnomis.gmail.com@xxxxxxxxxx> wrote:
> >
> > > From: Simon Baatz <gmbnomis@xxxxxxxxx>
> > >
> > > By default, the Linux TCP implementation does not shrink the
> > > advertised window (RFC 7323 calls this "window retraction") with the
> > > following exceptions:
> > >
> > > - When an incoming segment cannot be added due to the receive buffer
> > > running out of memory. Since commit 8c670bdfa58e ("tcp: correct
> > > handling of extreme memory squeeze") a zero window will be
> > > advertised in this case. It turns out that reaching the required
> > > "memory pressure" is very easy when window scaling is in use. In the
> > > simplest case, sending a sufficient number of segments smaller than
> > > the scale factor to a receiver that does not read data is enough.
> > >
> > > Since commit 1d2fbaad7cd8 ("tcp: stronger sk_rcvbuf checks") this
> > > happens much earlier than before, leading to regressions (the test
> > > suite of the Valkey project does not pass because of a TCP
> > > connection that is no longer bi-directional).
> >
> > Ouch. By the way, that same commit helped us unveil an issue (at least
> > in the sense of RFC 9293, 3.8.6) we fixed in passt:
> >
> > https://passt.top/passt/commit/?id=8d2f8c4d0fb58d6b2011e614bc7d7ff9dab406b3
>
> This looks concerning: It seems as if just filling the advertised
> window triggered the out of memory condition(?).

Right, even if it's not so much a general "out of memory" condition:
it's just that the socket might simply refuse to queue more data at
that point (we run out of window space, rather than memory).

Together with commit e2142825c120 ("net: tcp: send zero-window ACK when
no memory"), we will even get zero-window updates in that case. Jon
raised the issue here:

https://lore.kernel.org/r/20240406182107.261472-3-jmaloy@xxxxxxxxxx/

but it was not really fixed. Anyway:

> Am I right in
> assuming that this happened with the original 1d2fbaad7cd8, not the
> relaxed version of tcp_can_ingest() from f017c1f768b?

...you're right. I wasn't even aware of f017c1f768b, thanks for
pointing that out. That seems to make things saner, and I don't expect
further issues at this point.

By the way of which, passt struggled talking to applications entirely
written in the 21st century. That's socat, I think started in 2001,
being used in Podman tests, and its only SO_RCVBUF-related fault is
that it uses the default 208 KiB value (from rmem_default) as a
starting value by... not doing anything.

Applications can set SO_RCVBUF and SO_SNDBUF to bigger values
(depending on rmem_max and wmem_max), but if they do, automatic tuning
of TCP buffer sizes (which allows exceeding rmem_max and wmem_max!) is
disabled. We used to do that in passt itself, and I eventually dropped
it here:

https://passt.top/passt/commit/?id=71249ef3f9bcf1dbb2d6c13cdbc41ba88c794f06

because we might really need automatic tuning and the resulting big
buffers for high latency, high throughput connections.

> > > - Commit b650d953cd39 ("tcp: enforce receive buffer memory limits by
> > > allowing the tcp window to shrink") addressed the "eating memory"
> > > problem by introducing a sysctl knob that allows shrinking the
> > > window before running out of memory.
> > >
> > > However, RFC 7323 does not only state that shrinking the window is
> > > necessary in some cases, it also formulates requirements for TCP
> > > implementations when doing so (Section 2.4).
> > >
> > > This commit addresses the receiver-side requirements: After retracting
> > > the window, the peer may have a snd_nxt that lies within a previously
> > > advertised window but is now beyond the retracted window. This means
> > > that all incoming segments (including pure ACKs) will be rejected
> > > until the application happens to read enough data to let the peer's
> > > snd_nxt be in window again (which may be never).
> > >
> > > To comply with RFC 7323, the receiver MUST honor any segment that
> > > would have been in window for any ACK sent by the receiver and, when
> > > window scaling is in effect, SHOULD track the maximum window sequence
> > > number it has advertised. This patch tracks that maximum window
> > > sequence number throughout the connection and uses it in
> > > tcp_sequence() when deciding whether a segment is acceptable.
> > > Acceptability of data is not changed.
> > >
> > > Fixes: 8c670bdfa58e ("tcp: correct handling of extreme memory squeeze")
> > > Fixes: b650d953cd39 ("tcp: enforce receive buffer memory limits by allowing the tcp window to shrink")
> > > Signed-off-by: Simon Baatz <gmbnomis@xxxxxxxxx>
> > > ---
> > > Documentation/networking/net_cachelines/tcp_sock.rst | 1 +
> > > include/linux/tcp.h | 1 +
> > > include/net/tcp.h | 14 ++++++++++++++
> > > net/ipv4/tcp_fastopen.c | 1 +
> > > net/ipv4/tcp_input.c | 6 ++++--
> > > net/ipv4/tcp_minisocks.c | 1 +
> > > net/ipv4/tcp_output.c | 12 ++++++++++++
> > > .../selftests/net/packetdrill/tcp_rcv_big_endseq.pkt | 2 +-
> > > 8 files changed, 35 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/Documentation/networking/net_cachelines/tcp_sock.rst b/Documentation/networking/net_cachelines/tcp_sock.rst
> > > index 563daea10d6c5c074f004cb1b8574f5392157abb..fecf61166a54ee2f64bcef5312c81dcc4aa9a124 100644
> > > --- a/Documentation/networking/net_cachelines/tcp_sock.rst
> > > +++ b/Documentation/networking/net_cachelines/tcp_sock.rst
> > > @@ -121,6 +121,7 @@ u64 delivered_mstamp read_write
> > > u32 rate_delivered read_mostly tcp_rate_gen
> > > u32 rate_interval_us read_mostly rate_delivered,rate_app_limited
> > > u32 rcv_wnd read_write read_mostly tcp_select_window,tcp_receive_window,tcp_fast_path_check
> > > +u32 rcv_mwnd_seq read_write tcp_select_window
> > > u32 write_seq read_write tcp_rate_check_app_limited,tcp_write_queue_empty,tcp_skb_entail,forced_push,tcp_mark_push
> > > u32 notsent_lowat read_mostly tcp_stream_memory_free
> > > u32 pushed_seq read_write tcp_mark_push,forced_push
> > > diff --git a/include/linux/tcp.h b/include/linux/tcp.h
> > > index f72eef31fa23cc584f2f0cefacdc35cae43aa52d..5a943b12d4c050a980b4cf81635b9fa2f0036283 100644
> > > --- a/include/linux/tcp.h
> > > +++ b/include/linux/tcp.h
> > > @@ -271,6 +271,7 @@ struct tcp_sock {
> > > u32 lsndtime; /* timestamp of last sent data packet (for restart window) */
> > > u32 mdev_us; /* medium deviation */
> > > u32 rtt_seq; /* sequence number to update rttvar */
> > > + u32 rcv_mwnd_seq; /* Maximum window sequence number (RFC 7323, section 2.4) */
> >
> > Nit: tab between ; and /* for consistency (I would personally prefer
> > the comment style as you see on 'highest_sack' but I don't think it's
> > enforced anymore).
>
> Thanks, I missed that.
>
> > Second nit: mentioning RFC 7323, section 2.4 could be a bit misleading
> > here because the relevant paragraph there covers a very specific case of
> > window retraction, caused by quantisation error from window scaling,
> > which is not the most common case here. I couldn't quickly find a better
> > reference though.
>
> I agree, but there is a part that, I think, is more generally
> applicable:
>
> 2.4. Addressing Window Retraction
>
> [ specific window retraction case introduction removed ]
> ... Implementations MUST ensure that they handle a shrinking
> window, as specified in Section 4.2.2.16 of [RFC1122].
>
> For the receiver, this implies that:
>
> 1) The receiver MUST honor, as in window, any segment that would
> have been in window for any <ACK> sent by the receiver.
>
> 2) When window scaling is in effect, the receiver SHOULD track the
> actual maximum window sequence number (which is likely to be
> greater than the window announced by the most recent <ACK>, if
> more than one segment has arrived since the application consumed
> any data in the receive buffer).
>
> There is no "When window scaling is in effect," on the first
> requirement. And it "happens" to be implementable by the second
> requirement (with or without window scaling).

Right, I saw that, but the first requirement doesn't mention the
"actual maximum sequence number" which this new field represents.

> I think an improvement could be to refer to the receiver requirements
> specifically here.

Ah, yes, that sounds like a good idea.

> > More importantly: do we need to restore this on a connection that's
> > being dumped and recreated using TCP_REPAIR, or will things still work
> > (even though sub-optimally) if we lose this value?
> >
> > Other window values that *need* to be dumped and restored are currently
> > available via TCP_REPAIR_WINDOW socket option, and they are listed in
> > do_tcp_getsockopt(), net/ipv4/tcp.c:
> >
> > opt.snd_wl1 = tp->snd_wl1;
> > opt.snd_wnd = tp->snd_wnd;
> > opt.max_window = tp->max_window;
> > opt.rcv_wnd = tp->rcv_wnd;
> > opt.rcv_wup = tp->rcv_wup;
> >
> > CRIU uses it to checkpoint and restore established connections, and
> > passt uses it to migrate them to a different host:
> >
> > https://criu.org/TCP_connection
> >
> > https://passt.top/passt/tree/tcp.c?id=02af38d4177550c086bae54246fc3aaa33ddc018#n3063
> >
> > If it's strictly needed to preserve functionality, we would need to add
> > it to struct tcp_repair_window, notify CRIU maintainers (or send them a
> > patch), and add this in passt as well (I can take care of it).
>
> Thanks for the pointer, I missed that tp->rcv_wnd update. Could the
> following happen when checkpointing/restoring?
>
> 1. A client app opens a connection and writes (blocking) a specific amount
> of data before doing any reads. (Not very clever, but this is
> supposed to work; this is what caused the problem in the Valkey
> tests.)
> 2. The traffic pattern causes an out-of-memory condition for the
> receive buffer; we see the RWIN 0 segments that do not ack the
> last data segment(s).
> 3. TCP connection is checkpointed and restored (on the client side) without
> restoring rcv_mwnd_seq.
> 4. If the receive buffer is still full at the new location, the
> acceptable sequence numbers in the receive window will not change
> (restored client is still blocked on write) and we no longer have
> the larger max receive window -> the client's kernel will reject
> all incoming packets and the connection is stuck.
>
> If this scenario is possible, I'd argue that rcv_mwnd_seq is
> necessary.

It really sounds like a corner case, especially 1. in combination with
2., but the outcome would be pretty bad, and I think it's possible.

Typically, once the connection is restored (with TCP_REPAIR_OFF, not
with TCP_REPAIR_OFF_NO_WP), the kernel sends out an empty segment as a
window probe / keepalive, but as far as I understand that wouldn't be
enough to fix the situation. And even if it did, we still have the
TCP_REPAIR_OFF_NO_WP case, even though I'm not aware of any usage.

> > Strictly speaking, in case, this could be considered a breaking change
> > for userspace, but I don't see how to avoid it, so I'd just make sure
> > it doesn't impact users as TCP_REPAIR has just a couple of (known!)
> > projects relying on it.
> >
> > An alternative would be to have a special, initial value representing
> > the fact that this value was lost, but it looks really annoying to not
> > be able to use a u32 for it.
>
> Do we need a dedicated value indicating that rcv_mwnd_seq is not
> present, or is it enough to choose an initial rcv_mwnd_seq based on
> the size of the struct passed? Both seem doable to me:
>
> Missing: Initialize rcv_mwnd_seq = rcv_wup + rcv_wnd (possibly
> leading to the problem described above, of course)

Well but if we might run into the problem described above, we need to
dump / restore rcv_mwnd_seq in any case, so we wouldn't have an issue
at all.

Except for a compatibility issue, but what you describe looks like a
reasonable fallback.

> Default value 0: Store how much we retracted the window, i.e.
> rcv_mwnd_seq - (rcv_wup + rcv_wnd). 0 means the window was not
> retracted and could double as the "we don't know" value.
>
> For the time being, I will just initialize rcv_mwnd_seq to rcv_wup +
> rcv_wnd in tcp_repair_set_window() to keep status quo. Of course,
> I am happy to discuss enhancements.

That makes sense to me at a glance, but I should still review / test it
as a whole though.

> > Disregard all this if the correct value is not strictly needed for
> > functionality, of course. I haven't tested things (not yet, at least).
> >
> > > u64 tcp_wstamp_ns; /* departure time for next sent data packet */
> > > u64 accecn_opt_tstamp; /* Last AccECN option sent timestamp */
> > > struct list_head tsorted_sent_queue; /* time-sorted sent but un-SACKed skbs */
> > > diff --git a/include/net/tcp.h b/include/net/tcp.h
> > > index 40e72b9cb85f08714d3f458c0bd1402a5fb1eb4e..e1944d504823d5f8754d85bfbbf3c9630d2190ac 100644
> > > --- a/include/net/tcp.h
> > > +++ b/include/net/tcp.h
> > > @@ -912,6 +912,20 @@ static inline u32 tcp_receive_window(const struct tcp_sock *tp)
> > > return (u32) win;
> > > }
> > >
> > > +/* Compute the maximum receive window we ever advertised.
> > > + * Rcv_nxt can be after the window if our peer push more data
> >
> > s/push/pushes/
> >
> > s/Rcv_nxt/rcv_nxt/ (useful for grepping)
>
> tcp_max_receive_window() is an adapted copy of
> tcp_receive_window() above. But it makes sense to improve it.

Ah, sorry, I didn't notice.

> > > + * than the offered window.
> > > + */
> > > +static inline u32 tcp_max_receive_window(const struct tcp_sock *tp)
> > > +{
> > > + s32 win = tp->rcv_mwnd_seq - tp->rcv_nxt;
> > > +
> > > + if (win < 0)
> > > + win = 0;
> >
> > I must be missing something but... if the sequence is about to wrap,
> > we'll return 0 here. Is that intended?
> >
> > Doing the subtraction unsigned would have looked more natural to me,
> > but I didn't really think it through.
>
> The substraction is unsigned and the outcome is interpreted as
> signed. And as mentioned, it is copied with pride ;-)

Oh, wow, I mean, "of course"! How could anybody ever miss that! Pride,
you say. :) ...but sure, if it's taken from there, it makes sense to
keep it like that I guess.

> > > + return (u32) win;
> >
> > Kernel coding style doesn't usually include a space between cast and
> > identifier.
>
> Yes, same reason as above and I will change it.

--
Stefano