Re: [PATCH] blkdev: Fix blkdev_open to release the bdev on error

From: Al Viro
Date: Tue Dec 08 2015 - 02:59:04 EST


On Mon, Dec 07, 2015 at 10:49:05AM -0800, Linus Torvalds wrote:
> On Mon, Dec 7, 2015 at 10:05 AM, Suzuki K. Poulose
> <suzuki.poulose@xxxxxxx> wrote:
> > blkdev_open() doesn't release the bdev, it attached to a given
> > inode, if blkdev_get() fails (e.g, due to absence of a device).
> > This can cause kernel crashes when the original filesystem
> > tries to flush the data during evict_inode.
>
> Ugh. This code is a mess. Al, can you please comment?
>
> So what happens is that when "blkdev_get()" fails, it will do a
> bdput() on the bdev.

Yes.

> But blkdev_open() hasn't done a bdget(). It's done a bd_acquire().
> Which will do the whole "add inodes to bd_inodes".

Yes.

> And yes,
> bd_forget() will undo that.

It would, but there's no reason to drop the cached pointer to bdev.

> IOW, the path looks simple and apparently fixes an oops, but I'd like
> much more of an explanation for what happens, because it all feels
> wrong to me. Why doesn't the bdput() end up undoing the bd_acquire()
> properly?

Because it doesn't work that way. ->i_bdev is just a cached result of
lookup by device number. Once found, it stays there for as long as
neither the struct inode nor struct block_device are freed.

It does *NOT* pin struct block_device. Note that we have two kinds of block
device inodes - ones coallocated with struct block_device (those are unique
per major/minor, live on bdevfs and can't be seen directly) and ones aliasing
the first kind. Those live on normal filesystems.

Pagecache lives in ->i_data of the bdevfs inode; aliasing ones have their
->i_data empty and ->i_mapping pointing to ->i_data of the bdevfs inode.
That guarantees the cache coherency between those guys.

Now, simply having ->i_bdev point to struct block_device does not affect
the lifetime of the latter in any way. All aliases are dissociated from
block_device when bdevfs inode is evicted; block_device is dissociated
from aliasing inode when that aliasing inode is evicted. bdev_lock
provides the atomicity there.

_Opening_ an alias (any of them) does pin block_device down. So when an
aliasing inode is open, we can safely use its ->i_mapping in normal
pagecache-related code and have everything work correctly. Accessing
->i_mapping when inode isn't open is valid only if filesystem code is
sure it's pointing to its own ->i_data (and pointless in any case).

And that's what 9p ->evict_inode() is doing - it's trying to evict not
the pages in its ->i_data (which would be empty for block device), but
the pages in its ->i_mapping. IOW, the pagecache shared by all aliasing
inodes. Which is obviously bogus, regardless of lifetime rules violation -
mknod /tmp/foo b 8 1 && dd count=1 </tmp/foo >/dev/null && rm /tmp/foo
should not blow the cache of /dev/sda1, no matter which fs type we happen to
use for /tmp. And 9p will try to do just that.

Fortunately, no other ->evict_inode() instance is doing anything of that sort,
so we just need to fix that bogosity in 9p one.

As for the bdev eviction, bdput() acts exactly like iput(). In fact, it is
iput() of the coallocated bdevfs inode. It can stay around with zero
refcount; same as any other inode, memory pressure would eventually push
them out. It does *NOT* pin the driver when not opened, BTW.

Anyway, the fix for 9p bogosity follows; it definitely fixes a bug there,
and I'm fairly sure that it fixes the bug that had been reported.
A confirmation would be nice, of course...

Signed-off-by: Al Viro <viro@xxxxxxxxxxxxxxxxxx>
---
diff --git a/fs/9p/vfs_inode.c b/fs/9p/vfs_inode.c
index 699941e..5110785 100644
--- a/fs/9p/vfs_inode.c
+++ b/fs/9p/vfs_inode.c
@@ -451,9 +451,9 @@ void v9fs_evict_inode(struct inode *inode)
{
struct v9fs_inode *v9inode = V9FS_I(inode);

- truncate_inode_pages_final(inode->i_mapping);
+ truncate_inode_pages_final(&inode->i_data);
clear_inode(inode);
- filemap_fdatawrite(inode->i_mapping);
+ filemap_fdatawrite(&inode->i_data);

v9fs_cache_inode_put_cookie(inode);
/* clunk the fid stashed in writeback_fid */
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/