Re: [RFC PATCH 2/2] f2fs: fix allocation failure

From: Jaegeuk Kim
Date: Fri Oct 14 2016 - 16:38:10 EST


On Fri, Oct 14, 2016 at 10:09:29PM +0800, Chao Yu wrote:
...

> >> Finally it needs to update the dirty time of inode into inode page,
> >> and writeback the page, however, before that, we didn't count the inode
> >> as imeta data. So f2fs won't be aware of dirty metadata page count is
> >> exceeded watermark of GC, result in encountering panic when allocating
> >> free segment.
> >>
> >> There is an easy way to produce this bug:
> >> 1. mount with lazytime option
> >> 2. fragment space
> >> 3. touch all files in the image
> >> 4. umount
> >
> > I think modifying has_not_enough_secs() is enough like this.
>
> Seems it won't solve this problem as I tested, the root cause here is that if
> huge number of inode updates due to time changes, actually inodes won't be set
> dirty as we return directly if flags is I_DIRTY_TIME in f2fs_dirty_inode, then
> once inode cache is been shrunk, inodes in lru list will be set dirty in iput:
>
> In iput()
> if (inode->i_nlink && (inode->i_state & I_DIRTY_TIME)) {
> atomic_inc(&inode->i_count);
> inode->i_state &= ~I_DIRTY_TIME;
> spin_unlock(&inode->i_lock);
> trace_writeback_lazytime_iput(inode);
> mark_inode_dirty_sync(inode);
> goto retry;
>
> After that once someone calls write_checkpoint(), if number of dirty imeta data
> is exceeded remain blocks in free segments, we will encounter this bug.
>
> In order to fix this bug, I try to account these delayed dirtied inodes to
> detect actual dirty metadata number, by this way we can set delayed dirtied
> inode dirty and flush them in advance to avoid the dirty metadata number
> exceeding blocks number in free segments, finally allocation failure issue can
> be solved.

Yup, I did understand like that before. But actually, I tested this patch and
unfortunately I got a deadlock when running fsstress quickly. So, this makes me
rethink the whole design in terms of how to reuse the existing codes.

I think we're able to consider inode cache apart from FS consistency. The
IMETA list was actually keeping all the dirty inodes for FS consistency, but
it seems we don't need to store all of them which will consume free segments
dramatically during checkpoint.

I wrote a patch to add dirty inodes into IMETA list selectively.
The goal is to avoid adding dirty inodes by mark_inode_dirty_sync() called by
vfs, not by f2fs. Instead, those changes will be done by f2fs_write_inode().

Could you take a look at this?