Re: [PATCH] net/9p: fix infinite loop in p9_client_rpc on fatal signal
From: Vasiliy Kovalev
Date: Sun Jun 21 2026 - 18:09:04 EST
On 6/21/26 16:00, Dominique Martinet wrote:
Dominique Martinet wrote on Fri, Apr 17, 2026 at 07:52:52AM +0900:
While the ideal long-term goal is the asynchronous implementation (as
seen in your 9p-async-v2 branch [2]), this patch serves as a reliable
intermediate solution for a critical regression.
[2] https://github.com/martinetd/linux/commits/9p-async-v2
iirc one of the problem with the async branch is that the process would
quit immediately on, say, ^C, before the IO has completed, but it's
possible for the server to process the IO (and not the flush) afterwards
and you'd get something that's not supposed to happen e.g.
p1 p2
write(1)
^C/sigkill
flush sent but process exit without waiting for server ack
1 not written yet
write(2) in same spot
write(2) done
write(1) completes
data isn't 2 as expected after p2 completed
So it's quite possible async isn't the way to go, but that there is no
good solution for this
(given this is true even without async on sigkill: if we have something
that works safely, there's no reason to wait only for non-fatal signals...)
Sorry to come back to this after two months but I'm still a bit worried
about this patch, and just came back to it as I'm about to send the PR
to Linus...
And I'm still thinking about the problem above, or rather possible
variants involving cache (e.g. write going through the server, but
client believing it didn't because the response didn't make it in time)
.. But the thing is, I couldn't actually hit the `if
(fatal_signal_pending(current))` you added (adding some print
statement):
- if cache is enabled, the actual I/Os are done by the vfs in the
background, so any kill to user processes won't have any impact (and
thus I guess my main worry about cache is alleviated there)
- with cache=none I'm not sure why I can't hit it, I tried with an
external server, breaking on the write() call while running dd, and
killing dd with SIGKILL a few times but that doesn't appear to be
enough? (task still stuck in write > rpc > flush > rpc, but it doesn't
appear to ever get out of io_wait_event_killable() even when I hammer it
with more signals?)
So, given that my worry with cache is irrelevant (runs in background &
won't ever hit this), I can't seem to hit this with what I consider
to be normal workloads, and assuming it does fix your problems given you
were able to test it... I'll leave it in and send to Linus now but I'd
appreciate clarifications on how to test this more thoroughly as time
permits...
(I honestly probably should drop the patch at this point, but it'll
still be time to revert if I figure something out in the next few weeks
given it's been in -next for almost 2 months already)
Thanks,
Quoting myself from April: "Severity is low and likely unreachable in
production, but it slows down syzkaller — the hung process ties up
a worker slot until the harness kills it by timeout (143s on our
setup)."
The deterministic path is the syzkaller C reproducer:
https://syzkaller.appspot.com/x/repro.c?x=156aa534580000
What it does:
1) mounts 9p with trans=fd, rfdno/wfdno pointing to open fds
with nothing speaking the 9p protocol on the other side
- RFLUSH can never arrive;
2) the 9p rpc from mount parks a thread in io_wait_event_killable;
3) another thread triggers SIGSEGV via prctl(PR_SET_MM) + brk()
corruption -> coredump_wait;
4) the harness's kill_and_wait() fires 5s later.
To make both branches visible, debug diff on top of the patch:
diff --git a/net/9p/client.c b/net/9p/client.c
--- a/net/9p/client.c
+++ b/net/9p/client.c
@@ -600,8 +600,12 @@ p9_client_rpc(...)
if (err == -ERESTARTSYS && c->status == Connected &&
type == P9_TFLUSH) {
- if (fatal_signal_pending(current))
+ pr_info("9p-dbg: TFLUSH retry hit, fatal=%d\n",
+ fatal_signal_pending(current));
+ if (fatal_signal_pending(current)) {
+ pr_info("9p-dbg: bailing out via recalc_sigpending\n");
goto recalc_sigpending;
+ }
sigpending = 1;
clear_thread_flag(TIF_SIGPENDING);
goto again;
In the VM:
# gcc repro.c -o repro
# ./repro
dmesg fires on every iteration:
[root@localhost repro]# ./repro
executing program
[ 126.254054] repro[363]: segfault at 558a42e9ff30 ip 0000558a42e9ff30 sp 00007f3225ee4e80 error 14 likely on CPU 0 (core 0, socket 0)
[ 126.258095] Code: Unable to access opcode bytes at 0x558a42e9ff06.
[ 131.199937] 9pnet: 9p-dbg: TFLUSH retry hit, fatal=1
[ 131.201868] 9pnet: 9p-dbg: bailing out via recalc_sigpending
executing program
[ 131.270955] repro[366]: segfault at 558a42e9ff30 ip 0000558a42e9ff30 sp 00007f3225ee4e80 error 14 likely on CPU 3 (core 3, socket 0)
[ 131.275131] Code: Unable to access opcode bytes at 0x558a42e9ff06.
[ 136.219066] 9pnet: 9p-dbg: TFLUSH retry hit, fatal=1
[ 136.221359] 9pnet: 9p-dbg: bailing out via recalc_sigpending
executing program
[ 136.290772] repro[369]: segfault at 558a42e9ff30 ip 0000558a42e9ff30 sp 00007f3225ee4e80 error 14 likely on CPU 2 (core 2, socket 0)
[ 136.295901] Code: Unable to access opcode bytes at 0x558a42e9ff06.
[ 141.237955] 9pnet: 9p-dbg: TFLUSH retry hit, fatal=1
[ 141.239800] 9pnet: 9p-dbg: bailing out via recalc_sigpending
executing program
...
Without the patch the second pr_info never appears and the task
hangs in D-state.
On a real server I couldn't reproduce this by hand. The reproducer
hits the branch deterministically (logs above); why hand-issued
SIGKILLs don't get there is a kernel signal-delivery question
outside the path this patch touches, and I didn't dig into it.
Feel free to revert if anything turns up in the next weeks.
--
Thanks,
Vasiliy