Re: [syzbot] [netfs?] INFO: task hung in netfs_unbuffered_write_iter

From: K Prateek Nayak
Date: Thu Mar 27 2025 - 13:46:59 EST


Hello all,

On 3/25/2025 8:28 PM, Dominique Martinet wrote:
Thanks for the traces.

w/ revert
K Prateek Nayak wrote on Tue, Mar 25, 2025 at 08:19:26PM +0530:
kworker/100:1-1803 [100] ..... 286.618822: p9_fd_poll: p9_fd_poll rd poll
kworker/100:1-1803 [100] ..... 286.618822: p9_fd_poll: p9_fd_request wr poll
kworker/100:1-1803 [100] ..... 286.618823: p9_read_work: Data read wait 7

new behavior
repro-4076 [031] ..... 95.011394: p9_fd_poll: p9_fd_poll rd poll
repro-4076 [031] ..... 95.011394: p9_fd_poll: p9_fd_request wr poll
repro-4076 [031] ..... 99.731970: p9_client_rpc: Wait event killable (-512)

For me the problem isn't so much that this gets ERESTARTSYS but that it
nevers gets to read the 7 bytes that are available?

I'm back after getting distracted for a bit. So here it goes:

On the mainline, the signal is actually SIGKILL from parent thread
due to a timeout:

repro-4084 [112] d..2. 233.654264: signal_generate: sig=9 errno=0 code=0 comm=repro pid=4085 grp=1 res=0
repro-4084 [112] d..1. 233.654272: signal_generate: sig=9 errno=0 code=0 comm=repro pid=4085 grp=1 res=1
repro-4085 [125] ..... 233.654304: p9_client_rpc: Wait event killable (-512) (1) (0) (118)
repro-4085 [125] ..... 233.654308: p9_client_rpc: Flushed (-512)
repro-4085 [125] ..... 233.654312: p9_client_rpc: Final error (-512)
repro-4085 [125] ..... 233.654313: p9_client_write: p9_client_rpc done
repro-4085 [125] ..... 233.654313: p9_client_write: p9_client_rpc error (-512)
repro-4085 [125] ..... 233.654315: v9fs_issue_write: Issue write done 2 err(-512)

Specifically, this part:

for (;;) {
sleep_ms(10);
if (waitpid(-1, &status, WNOHANG | WAIT_FLAGS) == pid)
break;
if (current_time_ms() - start < 5000)
continue;
kill_and_wait(pid, &status); /* <------- here */
break;
}

Now for the 7 bytes that were not read - If you look at the traces,
every time there is a valid trans_mod->request(), there is a kworker
wakeup:

repro-4102 [100] ..... 286.618772: p9_client_write: p9_client_rpc
repro-4102 [100] ..... 286.618780: p9_fd_request: p9_fd_request
repro-4102 [100] ..... 286.618781: p9_fd_request: p9_fd_request EPOLL
repro-4102 [100] ..... 286.618781: p9_fd_request: p9_fd_request schedule work <----- this
kworker/100:1-1803 [100] ..... 286.618784: p9_write_work: Data write wait 32770

However, for that last 7 byte read, there is in fact no wakeup:

repro-4102 [100] ..... 286.618805: p9_client_write: p9_client_rpc
repro-4102 [100] ..... 286.618817: p9_fd_request: p9_fd_request
repro-4102 [100] ..... 286.618818: p9_fd_request: p9_fd_request EPOLL
repro-4102 [100] ..... 286.618818: p9_fd_poll: p9_fd_poll rd poll
repro-4102 [100] ..... 286.618818: p9_fd_poll: p9_fd_request wr poll <--- No wakeup after
kworker/100:1-1803 [100] ..... 286.618822: p9_fd_poll: p9_fd_poll rd poll
kworker/100:1-1803 [100] ..... 286.618822: p9_fd_poll: p9_fd_request wr poll
kworker/100:1-1803 [100] ..... 286.618823: p9_read_work: Data read wait 7
kworker/100:1-1803 [100] ..... 286.618825: p9_read_work: Data read 7

The kworker would spuriously wakeup, do something, realizes it wasn't
suppose to read it, and then disconnects the connection waking up all
waiters.

But if we haven't sent a request, why are we waiting at all?


If the repro has already written the bytes in both cases then there's no
reason to wait 5 seconds here...

OTOH syzbot 9p code is silly and might have been depending on something
that's not supposed to work e.g. they might be missing a flush or
equivalent for all I know (I still haven't looked at the repro)


I believe if c->trans_mod->request(c, req) does not dispatch the
request to the worker for RPC, the caller must not wait. So here
goes something that got the benchmark going in my case which might
be totally wrong but worth a look:

#syz test: upstream aaec5a95d59615523db03dd53c2052f0a87beea7

diff --git a/net/9p/trans_fd.c b/net/9p/trans_fd.c
index 196060dc6138..148533c993f1 100644
--- a/net/9p/trans_fd.c
+++ b/net/9p/trans_fd.c
@@ -687,7 +687,11 @@ static int p9_fd_request(struct p9_client *client, struct p9_req_t *req)
else
n = p9_fd_poll(m->client, NULL, NULL);
- if (n & EPOLLOUT && !test_and_set_bit(Wworksched, &m->wsched))
+ /* Request was not sent */
+ if (!(n & EPOLLOUT))
+ return -EIO;
+
+ if (!test_and_set_bit(Wworksched, &m->wsched))
schedule_work(&m->wq);
return 0;