Re: [syzbot] [netfs?] INFO: task hung in netfs_unbuffered_write_iter
From: asmadeus
Date: Fri Mar 28 2025 - 20:01:16 EST
Oleg Nesterov wrote on Fri, Mar 28, 2025 at 06:00:12PM +0100:
> As for the patches from me or Prateek (thanks again!), I think that
> the maintainers should take a look.
>
> But at this point I am mostly confident that the bisected commit aaec5a95d5961
> ("pipe_read: don't wake up the writer if the pipe is still full") is innocent,
> it just reveals yet another problem.
>
> I guess (I hope ;) Prateek agrees.
Right, so your patch sounds better than Prateek's initial blowing
up workaround, but it's a bit weird anyway so let me recap:
- that syz repro has this unnatural pattern where the replies are all
written before the requests are sent
- 9p_read_work() (read worker) has an optimization that if there is no
in fly request then there obviously must be nothing to read (9p is 100%
client initiated, there's no way the server should send something
first), so at this point the reader task is idle
(OTOH, we're checking for rx right at p9_conn_create() before anything
is sent, so it's not like we're consistent on this...)
- p9_fd_request() (sending a new request) has another optimization that
only checks for tx: at this point if another request was already in
flight then the rx task should have a poll going on for rx, and if there
were no in flight request yet then there should be no point in checking
for rx, so p9_fd_request() only kick in the tx worker if there is room
to send
- at this point I don't really get the logic that'll wake the rx thread
up either... p9_pollwake() will trigger p9_poll_workfn() (through
p9_poll_work) which will call p9_poll_mux() and get the reader kicking
again, but I don't know how the waitqueue mechanism is supposed to work
(see p9_pollwait())
I'd have expected the tx task to somehow nudge this on, but it
doesn't?...
- due to the new optimization (aaec5a95d59615 "pipe_read: don't wake up
the writer if the pipe is still full"), that 'if there is room to send'
check started failing and tx thread doesn't start? Because syzbot passed
us a pipe that was already full, or they messed with it after mount?
I'm not clear about this point, but I think it's the key here -- the 9p
"mount by fd" is great for local pseudo filesystems and things like that
but it's been abused by syzbot too much, and I don't want to spend too
much time making sure that any unholy things they do with these fd
works. If possible I'd like to mark that fd unusable by userspace but
I'm honestly doubtful it's possible (if e.g. it was dup'd or something
before mount for example...)
So, yeah, well, okay I don't mind the patch even if it doesn't make
sense with a regular server.
We don't really care about trans fd performance either so it's fine if
it's a bit slower, and the error Prateek added might happen in a real
case if tx queue is full of real requests so I think your approach is
good enough.
If we understand what's happening here I think it's as good as anything
else, but I'd just like it clear in the commit message what syzbot is
doing and why the regression happened
Thank you both for the thorough analysis and follow ups!
--
Dominique Martinet | Asmadeus