Re: BUG: corrupted list in p9_read_work

From: Dmitry Vyukov
Date: Wed Oct 10 2018 - 10:04:21 EST


On Tue, Oct 9, 2018 at 4:09 AM, Dominique Martinet
<asmadeus@xxxxxxxxxxxxx> wrote:
> syzbot wrote on Mon, Oct 08, 2018:
>> syzbot has found a reproducer for the following crash on:
>>
>> HEAD commit: 0854ba5ff5c9 Merge git://git.kernel.org/pub/scm/linux/kern..
>> git tree: upstream
>> console output: https://syzkaller.appspot.com/x/log.txt?x=1514ec06400000
>> kernel config: https://syzkaller.appspot.com/x/.config?x=88e9a8a39dc0be2d
>> dashboard link: https://syzkaller.appspot.com/bug?extid=2222c34dc40b515f30dc
>> compiler: gcc (GCC) 8.0.1 20180413 (experimental)
>> syz repro: https://syzkaller.appspot.com/x/repro.syz?x=10b91685400000
>>
>> IMPORTANT: if you fix the bug, please add the following tag to the commit:
>> Reported-by: syzbot+2222c34dc40b515f30dc@xxxxxxxxxxxxxxxxxxxxxxxxx
>>
>> list_del corruption, ffff88019ae36ee8->next is LIST_POISON1
>> (dead000000000100)
>> ------------[ cut here ]------------
>> [...]
>> list_del include/linux/list.h:125 [inline]
>> p9_read_work+0xab6/0x10e0 net/9p/trans_fd.c:379
>
> Hmm this looks very much like the report from
> syzbot+735d926e9d1317c3310c@xxxxxxxxxxxxxxxxxxxxxxxxx
> which should have been fixed by Tomas in 9f476d7c540cb
> ("net/9p/trans_fd.c: fix race by holding the lock")...
>
> It looks like another double list_del, looking at the code again there
> actually are other ways this could happen around connection errors.
> For example,
> - p9_read_work receives something and lookup works... meanwhile
> - p9_write_work fails to write and calls p9_conn_cancel, which deletes
> from the req_list without waiting for other works to finish (could also
> happen in p9_poll_mux)
> - p9_read_work finishes processing the read and deletes from list again
>
> For this one the simplest fix would probably be to just not
> list_del/call p9_client_cb at all if m->r?req->status isn't
> REQ_STATUS_ERROR in p9_read_work after the "got new packet" debug print,
> and frankly I think that's saner so I'll send a patch shortly doing
> that, but I have zero confidence there aren't similar bugs around, the
> tcp code is so messy... Most of the syzbot reports recently have been
> around trans_fd which I don't think is used much in real life, and this
> is not really motivating (i.e. I think it would probably need a more
> extensive rewrite but nobody cares) :/
>
>
> Dmitry, on that note, do you think syzbot could possibly test other
> transports somehow? rdma or virtio cannot be faked as easily as passing
> a fd around, but I'd be very interested in seeing these flayed a bit.

Hi Dominique,

How can they be faked?
If we could create a private rdma/virtio stub instance per test
process, then we could I think easily use that instance for 9p. But is
it possible?
Testing on real hardware is mostly outside of our priorities at the
moment. I mean syzkaller itself can be run on anything, and one could
extend descriptions to use a known rdma interface and run on a real
hardware. But we can't afford this at the moment.
As far as I understand RDMA maintainers run syzkaller on real
hardware, but I don't know if they are up to including 9p into
testing. +Leon