Re: [PATCH] dax: allow DAX to look up an inode's block device

From: Dan Williams
Date: Tue Feb 02 2016 - 18:39:22 EST


[ adding btrfs, resend with the correct list address ]

On Tue, Feb 2, 2016 at 3:19 PM, Al Viro <viro@xxxxxxxxxxxxxxxxxx> wrote:
> On Tue, Feb 02, 2016 at 04:11:42PM -0700, Ross Zwisler wrote:
>
>> However, for raw block devices and for XFS with a real-time device, the
>> value in inode->i_sb->s_bdev is not correct. With the code as it is
>> currently written, an fsync or msync to a DAX enabled raw block device will
>> cause a NULL pointer dereference kernel BUG. For this to work correctly we
>> need to ask the block device or filesystem what struct block_device is
>> appropriate for our inode.
>>
>> To that end, add a get_bdev(struct inode *) entry point to struct
>> super_operations. If this function pointer is non-NULL, this notifies DAX
>> that it needs to use it to look up the correct block_device. If
>> i_sb->get_bdev() is NULL DAX will default to inode->i_sb->s_bdev.
>
> Umm... It assumes that bdev will stay pinned for as long as inode is
> referenced, presumably? If so, that needs to be documented (and verified
> for existing fs instances). In principle, multi-disk fs might want to
> support things like "silently move the inodes backed by that disk to other
> ones"...

I assume btrfs is the only fs we have that might reassign the bdev for
a given inode on the fly? Hopefully we don't need anything stronger
than rcu_read_lock() to pin the result as valid.

At least in this case the initial user is dax-fsync where the
->get_bdev() answer should be static for the life of the inode, and
btrfs does not currently interface with dax. But yes, we need to get
the expected semantics clear.

On Tue, Feb 2, 2016 at 3:38 PM, Dan Williams <dan.j.williams@xxxxxxxxx> wrote:
> [ adding btrfs ]
>
> On Tue, Feb 2, 2016 at 3:19 PM, Al Viro <viro@xxxxxxxxxxxxxxxxxx> wrote:
>> On Tue, Feb 02, 2016 at 04:11:42PM -0700, Ross Zwisler wrote:
>>
>>> However, for raw block devices and for XFS with a real-time device, the
>>> value in inode->i_sb->s_bdev is not correct. With the code as it is
>>> currently written, an fsync or msync to a DAX enabled raw block device will
>>> cause a NULL pointer dereference kernel BUG. For this to work correctly we
>>> need to ask the block device or filesystem what struct block_device is
>>> appropriate for our inode.
>>>
>>> To that end, add a get_bdev(struct inode *) entry point to struct
>>> super_operations. If this function pointer is non-NULL, this notifies DAX
>>> that it needs to use it to look up the correct block_device. If
>>> i_sb->get_bdev() is NULL DAX will default to inode->i_sb->s_bdev.
>>
>> Umm... It assumes that bdev will stay pinned for as long as inode is
>> referenced, presumably? If so, that needs to be documented (and verified
>> for existing fs instances). In principle, multi-disk fs might want to
>> support things like "silently move the inodes backed by that disk to other
>> ones"...
>
> I assume btrfs is the only fs we have that might reassign the bdev for
> a given inode on the fly? Hopefully we don't need anything stronger
> than rcu_read_lock() to pin the result as valid.
>
> At least in this case the initial user is dax-fsync where the
> ->get_bdev() answer should be static for the life of the inode, and
> btrfs does not currently interface with dax. But yes, we need to get
> the expected semantics clear.