Re: Writeback bug causing writeback stalls

From: Jan Kara
Date: Fri May 22 2020 - 10:41:05 EST


Hi!

On Fri 22-05-20 11:57:42, Martijn Coenen wrote:
<snip>

> So, the sequence of events is something like this. Let's assume the inode is
> already on b_dirty_time for valid reasons. Then:
>
> CPU1 CPU2
> fuse_flush()
> write_inode_now()
> writeback_single_inode()
> sets I_SYNC
> __writeback_single_inode()
> writes back data
> clears inode dirty flags
> unlocks inode
> calls mark_inode_dirty_sync()
> sets I_DIRTY_SYNC, but doesn't
> update wb list because I_SYNC is
> still set
> write() // somebody else writes
> mark_inode_dirty(I_DIRTY_PAGES)
> sets I_DIRTY_PAGES on i_state
> doesn't update wb list,
> because I_SYNC set
> locks inode again
> sees inode is still dirty,
> doesn't touch WB list
> clears I_SYNC
>
> So now we have an inode on b_dirty_time with I_DIRTY_PAGES | I_DIRTY_SYNC set,
> and subsequent calls to mark_inode_dirty() with either I_DIRTY_PAGES or
> I_DIRTY_SYNC will do nothing to change that. The flusher won't touch
> the inode either,
> because it's not on a b_dirty or b_io list.

Thanks for the detailed analysis and explanation! I agree that what you
describe is a bug in the writeback code.

> The easiest way to fix this, I think, is to call requeue_inode() at the end of
> writeback_single_inode(), much like it is called from writeback_sb_inodes().
> However, requeue_inode() has the following ominous warning:
>
> /*
> * Find proper writeback list for the inode depending on its current state and
> * possibly also change of its state while we were doing writeback. Here we
> * handle things such as livelock prevention or fairness of writeback among
> * inodes. This function can be called only by flusher thread - noone else
> * processes all inodes in writeback lists and requeueing inodes behind flusher
> * thread's back can have unexpected consequences.
> */
>
> Obviously this is very critical code both from a correctness and a performance
> point of view, so I wanted to run this by the maintainers and folks who have
> contributed to this code first.

Sadly, the fix won't be so easy. The main problem with calling
requeue_inode() from writeback_single_inode() is that if there's parallel
sync(2) call, inode->i_io_list is used to track all inodes that need writing
before sync(2) can complete. So requeueing inodes in parallel while sync(2)
runs can result in breaking data integrity guarantees of it. But I agree
we need to find some mechanism to safely move inode to appropriate dirty
list reasonably quickly.

Probably I'd add an inode state flag telling that inode is queued for
writeback by flush worker and we won't touch dirty lists in that case,
otherwise we are safe to update current writeback list as needed. I'll work
on fixing this as when I was reading the code I've noticed there are other
quirks in the code as well. Thanks for the report!

Honza

--
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR