Re: BUG: corrupted list in p9_read_work

From: Dmitry Vyukov
Date: Wed Oct 10 2018 - 10:29:54 EST


On Tue, Oct 9, 2018 at 4:09 AM, Dominique Martinet
<asmadeus@xxxxxxxxxxxxx> wrote:
> syzbot wrote on Mon, Oct 08, 2018:
>> syzbot has found a reproducer for the following crash on:
>>
>> HEAD commit: 0854ba5ff5c9 Merge git://git.kernel.org/pub/scm/linux/kern..
>> git tree: upstream
>> console output: https://syzkaller.appspot.com/x/log.txt?x=1514ec06400000
>> kernel config: https://syzkaller.appspot.com/x/.config?x=88e9a8a39dc0be2d
>> dashboard link: https://syzkaller.appspot.com/bug?extid=2222c34dc40b515f30dc
>> compiler: gcc (GCC) 8.0.1 20180413 (experimental)
>> syz repro: https://syzkaller.appspot.com/x/repro.syz?x=10b91685400000
>>
>> IMPORTANT: if you fix the bug, please add the following tag to the commit:
>> Reported-by: syzbot+2222c34dc40b515f30dc@xxxxxxxxxxxxxxxxxxxxxxxxx
>>
>> list_del corruption, ffff88019ae36ee8->next is LIST_POISON1
>> (dead000000000100)
>> ------------[ cut here ]------------
>> [...]
>> list_del include/linux/list.h:125 [inline]
>> p9_read_work+0xab6/0x10e0 net/9p/trans_fd.c:379
>
> Hmm this looks very much like the report from
> syzbot+735d926e9d1317c3310c@xxxxxxxxxxxxxxxxxxxxxxxxx
> which should have been fixed by Tomas in 9f476d7c540cb
> ("net/9p/trans_fd.c: fix race by holding the lock")...
>
> It looks like another double list_del, looking at the code again there
> actually are other ways this could happen around connection errors.
> For example,
> - p9_read_work receives something and lookup works... meanwhile
> - p9_write_work fails to write and calls p9_conn_cancel, which deletes
> from the req_list without waiting for other works to finish (could also
> happen in p9_poll_mux)
> - p9_read_work finishes processing the read and deletes from list again
>
> For this one the simplest fix would probably be to just not
> list_del/call p9_client_cb at all if m->r?req->status isn't
> REQ_STATUS_ERROR in p9_read_work after the "got new packet" debug print,
> and frankly I think that's saner so I'll send a patch shortly doing
> that, but I have zero confidence there aren't similar bugs around, the
> tcp code is so messy... Most of the syzbot reports recently have been
> around trans_fd which I don't think is used much in real life, and this
> is not really motivating (i.e. I think it would probably need a more
> extensive rewrite but nobody cares) :/
>
>
> Dmitry, on that note, do you think syzbot could possibly test other
> transports somehow? rdma or virtio cannot be faked as easily as passing
> a fd around, but I'd be very interested in seeing these flayed a bit.
>
> (I'm also curious what logic is used to generate the syz tests, the
> write$P9_Rxx replies have nothing to do with what the client would
> expect so it probably doesn't test very far; this test in particular
> does not even get past the initial P9_TVERSION that the client would
> expect immediately after mount, so it's basically only testing logic
> around packet handling on error... Or if we're accepting a RREADDIR in
> reply to TVERSION we have bigger problems, and now I'm looking at it I
> think we just might never check that....... I'll look at that for the
> next cycle)
>
>
> Back to the current patch, since as I said I am not confident this is a
> good enough fix for the current bug, will I get notified if the bug
> happens again once the patch hits linux-next with the Reported-by tag ?
> (I don't have the setup necessary to run a syz repro as there is no C
> repro, and won't have much time to do that setup sorry)

Yes, the bug will be reported again if it still happens after the
patch is merged (not just into linux-next, but into all tested trees,
but it does not matter much). So marking bugs as fixed tentatively is
fine if that's our best guess.

But note that syzbot can test fixes itself on request. It boils down
to just giving it the patch and the base tree:
https://github.com/google/syzkaller/blob/master/docs/syzbot.md#testing-patches