Re: [PATCH 17/19] btrfs: shrink delayed allocation size in HMZONED mode

From: Josef Bacik
Date: Thu Jun 13 2019 - 11:07:18 EST


On Fri, Jun 07, 2019 at 10:10:23PM +0900, Naohiro Aota wrote:
> In a write heavy workload, the following scenario can occur:
>
> 1. mark page #0 to page #2 (and their corresponding extent region) as dirty
> and candidate for delayed allocation
>
> pages 0 1 2 3 4
> dirty o o o - -
> towrite - - - - -
> delayed o o o - -
> alloc
>
> 2. extent_write_cache_pages() mark dirty pages as TOWRITE
>
> pages 0 1 2 3 4
> dirty o o o - -
> towrite o o o - -
> delayed o o o - -
> alloc
>
> 3. Meanwhile, another write dirties page #3 and page #4
>
> pages 0 1 2 3 4
> dirty o o o o o
> towrite o o o - -
> delayed o o o o o
> alloc
>
> 4. find_lock_delalloc_range() decide to allocate a region to write page #0
> to page #4
> 5. but, extent_write_cache_pages() only initiate write to TOWRITE tagged
> pages (#0 to #2)
>
> So the above process leaves page #3 and page #4 behind. Usually, the
> periodic dirty flush kicks write IOs for page #3 and #4. However, if we try
> to mount a subvolume at this timing, mount process takes s_umount write
> lock to block the periodic flush to come in.
>
> To deal with the problem, shrink the delayed allocation region to have only
> expected to be written pages.
>
> Signed-off-by: Naohiro Aota <naohiro.aota@xxxxxxx>
> ---
> fs/btrfs/extent_io.c | 27 +++++++++++++++++++++++++++
> 1 file changed, 27 insertions(+)
>
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index c73c69e2bef4..ea582ff85c73 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -3310,6 +3310,33 @@ static noinline_for_stack int writepage_delalloc(struct inode *inode,
> delalloc_start = delalloc_end + 1;
> continue;
> }
> +
> + if (btrfs_fs_incompat(btrfs_sb(inode->i_sb), HMZONED) &&
> + (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages) &&
> + ((delalloc_start >> PAGE_SHIFT) <
> + (delalloc_end >> PAGE_SHIFT))) {
> + unsigned long i;
> + unsigned long end_index = delalloc_end >> PAGE_SHIFT;
> +
> + for (i = delalloc_start >> PAGE_SHIFT;
> + i <= end_index; i++)
> + if (!xa_get_mark(&inode->i_mapping->i_pages, i,
> + PAGECACHE_TAG_TOWRITE))
> + break;
> +
> + if (i <= end_index) {
> + u64 unlock_start = (u64)i << PAGE_SHIFT;
> +
> + if (i == delalloc_start >> PAGE_SHIFT)
> + unlock_start += PAGE_SIZE;
> +
> + unlock_extent(tree, unlock_start, delalloc_end);
> + __unlock_for_delalloc(inode, page, unlock_start,
> + delalloc_end);
> + delalloc_end = unlock_start - 1;
> + }
> + }
> +

Helper please. Really for all this hmzoned stuff I want it segregated as much
as possible so when I'm debugging or cleaning other stuff up I want to easily be
able to say "oh this is for zoned devices, it doesn't matter." Thanks,

Josef