Re: [PATCH] block: fix blk_get_backing_dev_info() crash, use bdev->bd_queue

From: Dan Williams
Date: Sun Jan 08 2017 - 15:50:35 EST


On Sun, Jan 8, 2017 at 11:46 AM, Jan Kara <jack@xxxxxxx> wrote:
> On Fri 06-01-17 09:45:45, Dan Williams wrote:
>> On Fri, Jan 6, 2017 at 2:23 AM, Jan Kara <jack@xxxxxxx> wrote:
>> > On Thu 05-01-17 17:17:55, Dan Williams wrote:
>> >> The ->bd_queue member of struct block_device was added in commit
>> >> 87192a2a49c4 ("vfs: cache request_queue in struct block_device") in
>> >> v3.3. However, blk_get_backing_dev_info() has been using
>> >> bdev_get_queue() and grabbing the request_queue through the gendisk
>> >> since before the git era.
>> >>
>> >> At final __blkdev_put() time ->bd_disk is cleared while ->bd_queue is
>> >> not. The queue remains valid until the final put of the parent disk.
>> >>
>> >> The following crash signature results from blk_get_backing_dev_info()
>> >> trying to lookup the queue through ->bd_disk after the final put of the
>> >> block device. Simply switch bdev_get_queue() to use ->bd_queue directly
>> >> which is guaranteed to still be valid at invalidate_partition() time.
>> >>
>> >> BUG: unable to handle kernel NULL pointer dereference at 0000000000000568
>> >> IP: blk_get_backing_dev_info+0x10/0x20
>> >> [..]
>> >> Call Trace:
>> >> __inode_attach_wb+0x3a7/0x5d0
>> >> __filemap_fdatawrite_range+0xf8/0x100
>> >> filemap_write_and_wait+0x40/0x90
>> >> fsync_bdev+0x54/0x60
>> >> ? bdget_disk+0x30/0x40
>> >> invalidate_partition+0x24/0x50
>> >> del_gendisk+0xfa/0x230
>> >
>> > So we have a similar reports of the same problem. E.g.:
>> >
>> > http://www.spinics.net/lists/linux-fsdevel/msg105153.html
>> >
>> > However I kind of miss how your patch would fix all those cases. The
>> > principial problem is that inode_to_bdi() called on block device inode
>> > wants to get the backing_dev_info however on last close of a block device
>> > we do put_disk() and thus the request queue containing backing_dev_info
>> > does not have to be around at that time. In your case you are lucky enough
>> > to have the containing disk still around but that's not the case for all
>> > inode_to_bdi() users (see e.g. the report I referenced) and your patch
>> > would change relatively straightforward NULL pointer dereference to rather
>> > subtle use-after-free issue
>>
>> True. If there are other cases that don't hold their own queue
>> reference this patch makes things worse.
>>
>> > so I disagree with going down this path.
>>
>> I still think this patch is the right thing to do, but it needs to
>> come after the wider guarantee that having an active bdev reference
>> guarantees that the queue and backing_dev_info are still allocated.
>>
>> > So what I think needs to be done is that we make backing_dev_info
>> > independently allocated structure with different lifetime rules to gendisk
>> > or request_queue - definitely I want it to live as long as block device
>> > inode exists. However it needs more thought what the exact lifetime rules
>> > will be.
>>
>> Hmm, why does it need to be separately allocated?
>>
>> Something like this, passes the libnvdimm unit tests: (non-whitespace
>> damaged version attached)
>
> So the problem with this approach is that request queue will be pinned while
> bdev inode exists. And how long that is is impossible to predict or influence
> from userspace so e.g. you cannot remove device driver from memory and even
> unplugging USB after it has been unmounted would suddently go via a path of
> "device removed while it is used" which can have unexpected consequences. I
> guess Jens or Christoph will know more about the details...

We do have the "block, fs: reliably communicate bdev end-of-life"
effort that I need to revisit:

http://www.spinics.net/lists/linux-fsdevel/msg93312.html

...but I don't immediately see how keeping the request_queue around
longer makes the situation worse?

> I have prototyped patches which split backing_dev_info from request_queue
> and it was not even that difficult in the end. I'll give those patches some
> testing and post them for comments...

As long as the crashes are addressed I don't much care which solution
goes upstream.

That said I think it is unfortunate that we introduced ->bd_queue in
v3.3, but most users are still pointer chasing through
->bd_disk->queue. I'll see if the 0day robot can find any performance
benefit to this change outside of trying to fix a bdev-unplug crash.