Re: [RFC PATCH net-next] tcp: Add net.ipv4.tcp_purge_receive_queue sysctl

From: Eric Dumazet

Date: Wed Feb 25 2026 - 03:32:27 EST


On Wed, Feb 25, 2026 at 8:46 AM Leon Hwang <leon.huangfu@xxxxxxxxxx> wrote:
>
> Introduce a new sysctl knob, net.ipv4.tcp_purge_receive_queue, to
> address a memory leak scenario related to TCP sockets.

We use the term "memory leak" for a persistent loss of memory (until reboot)

Lets not abuse it and confuse various AI/human agents which will
declare emergency situations
caused by an inexistent fatal error.

>
> Issue:
> When a TCP socket in the CLOSE_WAIT state receives a RST packet, the
> current implementation does not clear the socket's receive queue. This
> causes SKBs in the queue to remain allocated until the socket is
> explicitly closed by the application. As a consequence:
>
> 1. The page pool pages held by these SKBs are not released.

This situation also applies for normal TCP_ESTABLISHED sockets, when
applications
do not drain the receive queue.

As long the application has not called close(), kernel should not
assume the application
will _not_ read the data that was received.


> 2. The associated page pool cannot be freed.
>
> RFC 9293 Section 3.10.7.4 specifies that when a RST is received in
> CLOSE_WAIT state, "all segment queues should be flushed." However, the
> current implementation does not flush the receive queue.

Some buggy stacks send RST anyway after FIN. I think that forcingly
purging good data
received before the RST would add many surprises.

>
> Solution:
> Add a per-namespace sysctl (net.ipv4.tcp_purge_receive_queue) that,
> when enabled, causes the kernel to purge the receive queue when a RST
> packet is received in CLOSE_WAIT state. This allows immediate release
> of SKBs and their associated memory resources.
>
> The feature is disabled by default to maintain backward compatibility
> with existing behavior.
>
> Signed-off-by: Leon Hwang <leon.huangfu@xxxxxxxxxx>
> ---
> Documentation/networking/ip-sysctl.rst | 18 ++++++++++++++++++
> .../net_cachelines/netns_ipv4_sysctl.rst | 1 +
> include/net/netns/ipv4.h | 1 +
> net/ipv4/sysctl_net_ipv4.c | 9 +++++++++
> net/ipv4/tcp_input.c | 16 ++++++++++++++++
> 5 files changed, 45 insertions(+)
>
> diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst
> index d1eeb5323af0..71a529462baa 100644
> --- a/Documentation/networking/ip-sysctl.rst
> +++ b/Documentation/networking/ip-sysctl.rst
> @@ -1441,6 +1441,24 @@ tcp_rto_max_ms - INTEGER
>
> Default: 120,000
>
> +tcp_purge_receive_queue - BOOLEAN
> + When a socket in the TCP_CLOSE_WAIT state receives a RST packet, the
> + default behavior is to not clear its receive queue. As a result,
> + any SKBs in the queue are not freed until the socket is closed.
> + Consequently, the pages held by these SKBs are not released, which
> + can also prevent the associated page pool from being freed.
> +
> + If enabled, the receive queue is purged upon receiving the RST,
> + allowing the SKBs and their associated memory to be released
> + promptly.
> +
> + Possible values:
> +
> + - 0 (disabled)
> + - 1 (enabled)
> +
> + Default: 0 (disabled)
> +
> UDP variables
> =============
>
> diff --git a/Documentation/networking/net_cachelines/netns_ipv4_sysctl.rst b/Documentation/networking/net_cachelines/netns_ipv4_sysctl.rst
> index beaf1880a19b..f2c42e7d84a9 100644
> --- a/Documentation/networking/net_cachelines/netns_ipv4_sysctl.rst
> +++ b/Documentation/networking/net_cachelines/netns_ipv4_sysctl.rst
> @@ -123,6 +123,7 @@ unsigned_long sysctl_tcp_comp_sack_delay_ns
> unsigned_long sysctl_tcp_comp_sack_slack_ns __tcp_ack_snd_check
> int sysctl_max_syn_backlog
> int sysctl_tcp_fastopen
> +u8 sysctl_tcp_purge_receive_queue
> struct_tcp_congestion_ops tcp_congestion_control init_cc
> struct_tcp_fastopen_context tcp_fastopen_ctx
> unsigned_int sysctl_tcp_fastopen_blackhole_timeout
> diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
> index 8e971c7bf164..ab973f30f502 100644
> --- a/include/net/netns/ipv4.h
> +++ b/include/net/netns/ipv4.h
> @@ -220,6 +220,7 @@ struct netns_ipv4 {
> u8 sysctl_tcp_nometrics_save;
> u8 sysctl_tcp_no_ssthresh_metrics_save;
> u8 sysctl_tcp_workaround_signed_windows;
> + u8 sysctl_tcp_purge_receive_queue;
> int sysctl_tcp_challenge_ack_limit;
> u8 sysctl_tcp_min_tso_segs;
> u8 sysctl_tcp_reflect_tos;
> diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
> index 643763bc2142..da30970bb5d5 100644
> --- a/net/ipv4/sysctl_net_ipv4.c
> +++ b/net/ipv4/sysctl_net_ipv4.c
> @@ -1641,6 +1641,15 @@ static struct ctl_table ipv4_net_table[] = {
> .extra1 = SYSCTL_ONE_THOUSAND,
> .extra2 = &tcp_rto_max_max,
> },
> + {
> + .procname = "tcp_purge_receive_queue",
> + .data = &init_net.ipv4.sysctl_tcp_purge_receive_queue,
> + .maxlen = sizeof(u8),
> + .mode = 0644,
> + .proc_handler = proc_dou8vec_minmax,
> + .extra1 = SYSCTL_ZERO,
> + .extra2 = SYSCTL_ONE,
> + },
> };
>
> static __net_init int ipv4_sysctl_init_net(struct net *net)
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index 6c3f1d031444..43f32fb5831d 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -4895,6 +4895,7 @@ EXPORT_IPV6_MOD(tcp_done_with_error);
> /* When we get a reset we do this. */
> void tcp_reset(struct sock *sk, struct sk_buff *skb)
> {
> + const struct net *net = sock_net(sk);
> int err;
>
> trace_tcp_receive_reset(sk);
> @@ -4911,6 +4912,21 @@ void tcp_reset(struct sock *sk, struct sk_buff *skb)
> err = ECONNREFUSED;
> break;
> case TCP_CLOSE_WAIT:
> + /* RFC9293 3.10.7.4. Other States
> + * Second, check the RST bit:
> + * CLOSE-WAIT STATE
> + *
> + * If the RST bit is set, then any outstanding RECEIVEs and
> + * SEND should receive "reset" responses. All segment queues
> + * should be flushed. Users should also receive an unsolicited
> + * general "connection reset" signal. Enter the CLOSED state,
> + * delete the TCB, and return.
> + *
> + * If net.ipv4.tcp_purge_receive_queue is enabled,
> + * sk_receive_queue will be flushed too.
> + */
> + if (unlikely(net->ipv4.sysctl_tcp_purge_receive_queue))
> + skb_queue_purge(&sk->sk_receive_queue);
> err = EPIPE;
> break;
> case TCP_CLOSE:
> --
> 2.52.0
>

Please prepare a packetdrill test.