Re: [PATCH] net/9p: fix infinite loop in p9_client_rpc on fatal signal
From: Vasiliy Kovalev
Date: Thu Apr 16 2026 - 08:51:21 EST
Hi Dominique,
On 4/16/26 04:41, Dominique Martinet wrote:
Vasiliy Kovalev wrote on Wed, Apr 15, 2026 at 06:52:37PM +0300:
The same defect is present in stable kernels back to 5.4. On those
kernels the infinite loop is broken earlier by a second SIGKILL from
the parent process (e.g. kill_and_wait() retrying after a timeout),
resulting in a zombie process and a shutdown delay rather than a
permanent D-state hang, but the underlying flaw is the same.
Found by Linux Verification Center (linuxtesting.org) with Syzkaller.
It's unfortunately not that simple, just leaving here will leak things.
Following up on your concern about leaking things (as mentioned in the
thread with Shigeru Yoshida's patch) - I did some tracing with `printk`
inside `p9_req_put` and `p9_tag_remove` to see exactly what happens to
the request lifecycle with my proposed patch vs Shigeru's approach:
diff --git a/net/9p/client.c b/net/9p/client.c
index 748b92d3f0c1..4e8f4e0195b4 100644
--- a/net/9p/client.c
+++ b/net/9p/client.c
@@ -274,10 +274,13 @@ static void p9_tag_remove(struct p9_client *c, struct p9_req_t *r)
spin_lock_irqsave(&c->lock, flags);
idr_remove(&c->reqs, tag);
spin_unlock_irqrestore(&c->lock, flags);
+ printk("tag %d removed from IDR\n", tag);
}
int p9_req_put(struct p9_client *c, struct p9_req_t *r)
{
+ printk("p9_req_put req %p tag %d refcount before=%d\n",
+ r, r->tc.tag, refcount_read(&r->refcount));
if (refcount_dec_and_test(&r->refcount)) {
p9_tag_remove(c, r);
There is a significant difference.
If we use Shigeru's approach [1] (preventing the flush from being
sent/processed using `!fatal_signal_pending(current)` in the condition),
the original request (tag 0) is never freed from the IDR. It creates a
definitive memory and tag leak:
[root@localhost]# ./repro
executing program
[ 50.613200] p9_req_put req ff11000110157cb0 tag 65535 refcount before=3
[ 50.663513] p9_req_put req ff11000110157cb0 tag 65535 refcount before=2
[ 50.665156] p9_req_put req ff11000110157cb0 tag 65535 refcount before=1
[ 50.666593] tag 65535 removed from IDR
executing program
[ 50.688232] p9_req_put req ff11000110157ba0 tag 65535 refcount before=3
[ 50.739634] p9_req_put req ff11000110157ba0 tag 65535 refcount before=2
[ 50.741170] p9_req_put req ff11000110157ba0 tag 65535 refcount before=1
[ 50.742587] tag 65535 removed from IDR
executing program
[ 50.775708] p9_req_put req ff11000110157a90 tag 65535 refcount before=3
[ 50.825104] p9_req_put req ff11000110157a90 tag 65535 refcount before=2
[ 50.827719] p9_req_put req ff11000110157a90 tag 65535 refcount before=1
[ 50.830432] tag 65535 removed from IDR
executing program
...
[1] https://lore.kernel.org/all/20260322111848.2837057-1-syoshida@xxxxxxxxxx/
However, with my patch (jumping to `recalc_sigpending` *inside* the
`P9_TFLUSH` wait loop), the logic works cleanly due to the `refcount`
mechanism:
[root@localhost]# ./repro
executing program
[ 48.875093] p9_req_put req ff110001056fdcb0 tag 65535 refcount before=3
[ 48.925981] p9_req_put req ff110001056fdba0 tag 0 refcount before=3
[ 53.873969] p9_req_put req ff110001056fdba0 tag 0 refcount before=2
[ 53.876638] p9_req_put req ff110001056fdcb0 tag 65535 refcount before=2
[ 53.879882] p9_req_put req ff110001056fdba0 tag 0 refcount before=1
[ 53.882505] tag 0 removed from IDR
[ 53.884184] p9_req_put req ff110001056fdcb0 tag 65535 refcount before=1
[ 53.886905] tag 65535 removed from IDR
executing program
[ 53.914764] p9_req_put req ff1100010ca97a90 tag 65535 refcount before=3
[ 53.967191] p9_req_put req ff1100010ca97980 tag 0 refcount before=3
[ 58.909626] p9_req_put req ff1100010ca97980 tag 0 refcount before=2
[ 58.912420] p9_req_put req ff1100010ca97a90 tag 65535 refcount before=2
[ 58.915358] p9_req_put req ff1100010ca97980 tag 0 refcount before=1
[ 58.918478] tag 0 removed from IDR
[ 58.920730] p9_req_put req ff1100010ca97a90 tag 65535 refcount before=1
[ 58.923715] tag 65535 removed from IDR
executing program
[ 58.946905] p9_req_put req ff1100010c5a3870 tag 65535 refcount before=3
[ 58.998260] p9_req_put req ff1100010c5a3760 tag 0 refcount before=3
[ 63.945960] p9_req_put req ff1100010c5a3760 tag 0 refcount before=2
[ 63.948549] p9_req_put req ff1100010c5a3870 tag 65535 refcount before=2
[ 63.951153] p9_req_put req ff1100010c5a3760 tag 0 refcount before=1
[ 63.953651] tag 0 removed from IDR
[ 63.957254] p9_req_put req ff1100010c5a3870 tag 65535 refcount before=1
[ 63.960655] tag 65535 removed from IDR
executing program
...
1. The dying thread breaks out of the loop and calls `p9_req_put()`.
The refcount correctly drops from 2 to 1.
2. The tag is NOT removed from the IDR yet, preventing any Use-After-Free.
The request is safely "orphaned".
3. Shortly after, when the parent process closes the pipe/socket (or the
transport is torn down due to process exit), the transport layer cleans up
the pending requests. It calls the final `p9_req_put()`, refcount hits 0,
and the tag is properly removed from the IDR.
My logs clearly show `refcount before=1 -> tag 0 removed from IDR` upon
transport teardown, proving there is no permanent leak when breaking out
of the `TFLUSH` loop this way.
Since this safely resolves the permanent D-state hang during
`coredump_wait()` without causing the UAF or leaks that plagued earlier
kernels (pre-`728356dedeff` ("9p: Add refcount to p9_req_t")), do you
think this pragmatic fix is acceptable for mainline and stable?
There was another patch last month, help is welcome,
please see https://lore.kernel.org/r/ab_b-95Ok9zLQCdn@xxxxxxxxxxxxx
While the ideal long-term goal is the asynchronous implementation (as
seen in your 9p-async-v2 branch [2]), this patch serves as a reliable
intermediate solution for a critical regression.
[2] https://github.com/martinetd/linux/commits/9p-async-v2
(By the way, what's with the rekindled interest on this?...)
The issue was identified by Syzkaller, and I have a stable C reproducer
that consistently triggers the hang. There is also a similar report on
Syzbot with a nearly identical reproducer:
https://syzkaller.appspot.com/text?tag=ReproC&x=156aa534580000
--
Thanks,
Vasiliy