Re: [PATCH 7/8] bdi: add a new writeback list for sync V3

From: Jan Kara
Date: Wed Apr 01 2015 - 04:44:33 EST


Hi Josef,

On Fri 20-03-15 14:49:18, Josef Bacik wrote:
> From: Dave Chinner <dchinner@xxxxxxxxxx>
>
> wait_sb_inodes() current does a walk of all inodes in the filesystem
> to find dirty one to wait on during sync. This is highly
> inefficient and wastes a lot of CPU when there are lots of clean
> cached inodes that we don't need to wait on.
>
> To avoid this "all inode" walk, we need to track inodes that are
> currently under writeback that we need to wait for. We do this by
> adding inodes to a writeback list on the bdi when the mapping is
> first tagged as having pages under writeback. wait_sb_inodes() can
> then walk this list of "inodes under IO" and wait specifically just
> for the inodes that the current sync(2) needs to wait for.
>
> To avoid needing all the realted locking to be safe against
> interrupts, Jan Kara suggested that we be lazy about removal from
> the writeback list. That is, we don't remove inodes from the
> writeback list on IO completion, but do it directly during a
> wait_sb_inodes() walk.
>
> This means that the a rare sync(2) call will have some work to do
> skipping clean inodes However, in the current problem case of
> concurrent sync workloads, concurrent wait_sb_inodes() calls only
> walk the very recently dispatched inodes and hence should have very
> little work to do.
>
> This also means that we have to remove the inodes from the writeback
> list during eviction. Do this in inode_wait_for_writeback() once
> all writeback on the inode is complete.
>
> Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
> Signed-off-by: Josef Bacik <jbacik@xxxxxx>
> ---
> V2->V3:
> -noticed a lockdep warning because of mapping->tree_lock-->wb.list_lock
> dependancy and the need for tree_lock to be under IRQ. I re-arranged it so the
> dependancy is now wb.list_lock-->mapping->tree_lock and it seems to be ok, Jan
> could you look hard at this?
>
...
> @@ -1299,7 +1337,9 @@ EXPORT_SYMBOL(__mark_inode_dirty);
> */
> static void wait_sb_inodes(struct super_block *sb)
> {
> - struct inode *inode, *old_inode = NULL;
> + struct backing_dev_info *bdi = sb->s_bdi;
> + struct inode *old_inode = NULL;
> + LIST_HEAD(sync_list);
>
> /*
> * We need to be protected against the filesystem going from
> @@ -1307,28 +1347,59 @@ static void wait_sb_inodes(struct super_block *sb)
> */
> WARN_ON(!rwsem_is_locked(&sb->s_umount));
>
> - mutex_lock(&sb->s_sync_lock);
> - spin_lock(&sb->s_inode_list_lock);
> -
> /*
> - * Data integrity sync. Must wait for all pages under writeback,
> - * because there may have been pages dirtied before our sync
> - * call, but which had writeout started before we write it out.
> - * In which case, the inode may not be on the dirty list, but
> - * we still have to wait for that writeout.
> + * Data integrity sync. Must wait for all pages under writeback, because
> + * there may have been pages dirtied before our sync call, but which had
> + * writeout started before we write it out. In which case, the inode
> + * may not be on the dirty list, but we still have to wait for that
> + * writeout.
> + *
> + * To avoid syncing inodes put under IO after we have started here,
> + * splice the io list to a temporary list head and walk that. Newly
> + * dirtied inodes will go onto the primary list so we won't wait for
> + * them. This is safe to do as we are serialised by the s_sync_lock,
> + * so we'll complete processing the complete list before the next
> + * sync operation repeats the splice-and-walk process.
> + *
> + * Inodes that have pages under writeback after we've finished the wait
> + * may or may not be on the primary list. They had pages under put IO
> + * after we started our wait, so we need to make sure the next sync or
> + * IO completion treats them correctly, Move them back to the primary
> + * list and restart the walk.
> + *
> + * Inodes that are clean after we have waited for them don't belong on
> + * any list, so simply remove them from the sync list and move onwards.
> */
> - list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
> + mutex_lock(&sb->s_sync_lock);
> + spin_lock(&bdi->wb.list_lock);
> + list_splice_init(&bdi->wb.b_writeback, &sync_list);
> +
> + while (!list_empty(&sync_list)) {
> + struct inode *inode = list_first_entry(&sync_list, struct inode,
> + i_wb_list);
> struct address_space *mapping = inode->i_mapping;
>
> + /*
> + * We are lazy on IO completion and don't remove inodes from the
> + * list when they are clean. Detect that immediately and skip
> + * inodes we don't ahve to wait on.
> + */
> + if (!mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK)) {
> + list_del_init(&inode->i_wb_list);
> + cond_resched_lock(&bdi->wb.list_lock);
> + continue;
> + }
> +
> spin_lock(&inode->i_lock);
> - if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
> - (mapping->nrpages == 0)) {
> + if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) {
> + list_move(&inode->i_wb_list, &bdi->wb.b_writeback);
> spin_unlock(&inode->i_lock);
> + cond_resched_lock(&bdi->wb.list_lock);
> continue;
> }
> __iget(inode);
> spin_unlock(&inode->i_lock);
> - spin_unlock(&sb->s_inode_list_lock);
> + spin_unlock(&bdi->wb.list_lock);
Hum, so you seem to keep that dance with inode references around (keeping
refs to current and prev inode). However that isn't necessary anymore once
you always pull the first inode from the list. So just grab reference to
the inode before dropping i_lock and put it when you are done with it...

> @@ -1345,9 +1416,21 @@ static void wait_sb_inodes(struct super_block *sb)
>
> cond_resched();
>
> - spin_lock(&sb->s_inode_list_lock);
> + /*
> + * the inode has been written back now, so check whether we
> + * still have pages under IO and move it back to the primary
> + * list if necessary.
> + */
> + spin_lock(&bdi->wb.list_lock);
> + spin_lock_irq(&mapping->tree_lock);
Please comment that tree_lock is really necessary here. We need it
because bdi_mark_inode_writeback() needn't have bothered with grabbing
list_lock and may have exited without doing anything when it saw inode on
our spliced list.

Honza
--
Jan Kara <jack@xxxxxxx>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/