Re: [PATCH] block: flush writeback dwork before detaching a bdev inode from it

From: Jan Kara
Date: Mon Jun 20 2016 - 09:32:00 EST


Hi,

On Fri 17-06-16 12:04:05, Tejun Heo wrote:
> 43d1c0eb7e11 ("block: detach bdev inode from its wb in
> __blkdev_put()") detached bdev inode from its wb as the bdev inode may
> outlive the underlying bdi and thus the wb. This is accomplished by
> invoking inode_detach_wb() from __blkdev_put(); however, while the
> inode can't be dirtied by the time control reaches there, that doesn't
> guarantee that writeback isn't already in progress on the inode. This
> can lead to the inode being disassociated from its wb while writeback
> operation is in flight causing oopses like the following.

<snip>

> Fix it by flushing the wb dwork before detaching the inode. Combined
> with the fact that the inode can no longer be dirtied, this guarantees
> that no writeback operation can be in flight or initiated.

Sorry for the late reply but now when thinking about the patch I don't
think it is quite right. Writeback can happen from other contexts than just
the worker one (e.g. kswapd can do writeback, or in some out-of-memory
situations we may punt to doing writeback directly instead of calling the
worker, or sync(2) calls fdatawrite() for block device inode directly when
iterating through blockdev superblock). So flushing the workqueue IMHO is
not covering 100% of cases.

I wanted to suggest to use inode_wait_for_writeback() which will make sure
writeback is done with the inode. However we effectively already do this
by calling bdev_write_inode() which calls writeback_single_inode() in
WB_SYNC_ALL mode. So by the time that call completes we are sure writeback
code is not looking at the inode. *However* what I think is happening is
that sync_blockdev() writes all the dirty pages, bdev_write_inode() writes
the inode and clears all dirty bits, however the inode still stays in the
b_dirty / b_io list of the wb because it has I_DIRTY_TIME set. Subsequently
once flusher work runs, it finds the inode, looks at it and boom. And the
problem seems to be that write_inode_now(inode, true) does not result in
I_DIRTY_TIME being cleared.

Attached patch should fix this issue - it is compile-tested only. Dmitry,
can you check whether this patch fixes the issue for you as well?

Honza
--
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR