Re: [RFC PATCH net-next] tcp: Add net.ipv4.tcp_purge_receive_queue sysctl

From: Eric Dumazet

Date: Tue Mar 03 2026 - 04:02:00 EST

On Tue, Mar 3, 2026 at 9:54 AM Leon Hwang <leon.hwang@xxxxxxxxx> wrote:
>
>
>
> On 3/3/26 16:17, Eric Dumazet wrote:
> > On Tue, Mar 3, 2026 at 8:55 AM Leon Hwang <leon.hwang@xxxxxxxxx> wrote:
> >>
> >>
> >>
> >> On 3/3/26 14:26, Leon Hwang wrote:
> >>>
> >>>
> >>> On 3/3/26 11:55, Eric Dumazet wrote:
> >>>> On Tue, Mar 3, 2026 at 3:12 AM Leon Hwang <leon.hwang@xxxxxxxxx> wrote:
> >>>>>
> >>>>>
> >>>>>
> >>>>> On 3/3/26 08:22, Jakub Kicinski wrote:
> >>>>>> On Mon, 2 Mar 2026 17:55:59 +0800 Leon Hwang wrote:
> >>>>>>> On 26/2/26 09:43, Jakub Kicinski wrote:
> >>>>>>>> On Wed, 25 Feb 2026 15:46:33 +0800 Leon Hwang wrote:
> >>>>>>>>> Issue:
> >>>>>>>>> When a TCP socket in the CLOSE_WAIT state receives a RST packet, the
> >>>>>>>>> current implementation does not clear the socket's receive queue. This
> >>>>>>>>> causes SKBs in the queue to remain allocated until the socket is
> >>>>>>>>> explicitly closed by the application. As a consequence:
> >>>>>>>>>
> >>>>>>>>> 1. The page pool pages held by these SKBs are not released.
> >>>>>>>>
> >>>>>>>> On what kernel version and driver are you observing this?
> >>>>>>>
> >>>>>>> # uname -r
> >>>>>>> 6.19.0-061900-generic
> >>>>>>>
> >>>>>>> # ethtool -i eth0
> >>>>>>> driver: mlx5_core
> >>>>>>> version: 6.19.0-061900-generic
> >>>>>>> firmware-version: 26.43.2566 (MT_0000000531)
> >>>>>>
> >>>>>> Okay... this kernel + driver should just patiently wait for the page
> >>>>>> pool to go away.
> >>>>>>
> >>>>>> What is the actual, end user problem that you're trying to solve?
> >>>>>> A few kB of data waiting to be freed is not a huge problem..
> >>>>>
> >>>>> Yes, it is not a huge problem.
> >>>>>
> >>>>> The actual end-user issue was discussed in
> >>>>> "page_pool: Add page_pool_release_stalled tracepoint" [1].
> >>>>>
> >>>>> I think it would be useful to provide a way for SREs to purge the
> >>>>> receive queue when CLOSE_WAIT TCP sockets receive RST packets. If the
> >>>>> NIC, e.g., Mellanox, flaps, the underlying page pool and pages can be
> >>>>> released at the same time.
> >>>>>
> >>>>> Links:
> >>>>> [1]
> >>>>> https://lore.kernel.org/netdev/b676baa0-2044-4a74-900d-f471620f2896@xxxxxxxxx/
> >>>>
> >>>> Perhaps SRE could use this in an emergency?
> >>>>
> >>>> ss -t -a state close-wait -K
> >>>
> >>> This ss command is acceptable in an emergency.
> >>>
> >>
> >> However, once a CLOSE_WAIT TCP socket receives an RST packet, it
> >> transitions to the CLOSE state. A socket in the CLOSE state cannot be
> >> killed using the ss approach.
> >>
> >> The SKBs remain in the receive queue of the CLOSE socket until it is
> >> closed by the user-space application.
> >
> > Why user-space application does not drain the receive queue ?
> >
> > Is there a missing EPOLLIN or something ?
>
> The user-space application uses a TCP connection pool. It establishes
> several TCP connections at startup and keeps them in the pool.
>
> However, the application does not always drain their receive queues.
> Instead, it selects one connection from the pool using a hash algorithm
> for communication with the TCP server. When it attempts to write data
> through a socket in the CLOSE state, it receives -EPIPE and then closes
> it. As a result, TCP connections whose underlying socket state is CLOSE
> may retain an SKB in their receive queues if they are not selected for
> communication.
>
> I proposed a solution to address this issue: close the TCP connection if
> the underlying sk_err is non-zero.

Okay, makes sense to fix the root cause. Applications can be fixed in
a matter of hours,
while kernels can stick to hosts for years.