Re: fiemap is slow on btrfs on files with multiple extents

From: Filipe Manana
Date: Wed Sep 21 2022 - 05:01:31 EST


On Wed, Sep 21, 2022 at 8:30 AM Dominique MARTINET
<dominique.martinet@xxxxxxxxxxxxxxxxx> wrote:
>
> Filipe Manana wrote on Thu, Sep 01, 2022 at 02:25:12PM +0100:
> > It took me a bit more than I expected, but here is the patchset to make fiemap
> > (and lseek) much more efficient on btrfs:
> >
> > https://lore.kernel.org/linux-btrfs/cover.1662022922.git.fdmanana@xxxxxxxx/
> >
> > And also available in this git branch:
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/fdmanana/linux.git/log/?h=lseek_fiemap_scalability
>
> Thanks a lot!
> Sorry for the slow reply, it took me a while to find time to get back to
> my test setup.
>
> There's still this weird behaviour that later calls to cp are slower
> than the first, but the improvement is so good that it doesn't matter
> quite as much -- I haven't been able to reproduce the rcu stalls in qemu
> so I can't say for sure but they probably won't be a problem anymore.
>
> From a quick look with perf record/report the difference still seems to
> stem from fiemap (time spent there goes from 4.13 to 45.20%), so there
> is still more processing once the file is (at least partially) in cache,
> but it has gotten much better.
>
>
> (tests run on a laptop so assume some inconsistency with thermal
> throttling etc)
>
> /mnt/t/t # compsize bigfile
> Processed 1 file, 194955 regular extents (199583 refs), 0 inline.
> Type Perc Disk Usage Uncompressed Referenced
> TOTAL 15% 3.7G 23G 23G
> none 100% 477M 477M 514M
> zstd 14% 3.2G 23G 23G
> /mnt/t/t # time cp bigfile /dev/null
> real 0m 44.52s
> user 0m 0.49s
> sys 0m 32.91s
> /mnt/t/t # time cp bigfile /dev/null
> real 0m 46.81s
> user 0m 0.55s
> sys 0m 35.63s
> /mnt/t/t # time cp bigfile /dev/null
> real 1m 13.63s
> user 0m 0.55s
> sys 1m 1.89s
> /mnt/t/t # time cp bigfile /dev/null
> real 1m 13.44s
> user 0m 0.53s
> sys 1m 2.08s
>
>
> For comparison here's how it was on 6.0-rc2 your branch is based on:
> /mnt/t/t # time cp atde-test /dev/null
> real 0m 46.17s
> user 0m 0.60s
> sys 0m 33.21s
> /mnt/t/t # time cp atde-test /dev/null
> real 5m 35.92s
> user 0m 0.57s
> sys 5m 24.20s
>
>
>
> If you're curious the report blames set_extent_bit and
> clear_state_bit as follow; get_extent_skip_holes is completely gone; but
> I wouldn't necessarily say this needs much more time spent on it.

get_extent_skip_holes() no longer exists, so 0% of time spent there :)

Yes, I know. The reason you see so much time spent on
lock_extent_bits() is basically
because cp does too many fiemap calls with a very small extent buffer size.
I pointed that out here:

https://lore.kernel.org/linux-btrfs/CAL3q7H5NSVicm7nYBJ7x8fFkDpno8z3PYt5aPU43Bajc1H0h1Q@xxxxxxxxxxxxxx/

Making it use a larger buffer (say 500 or 1000 extents), would make it
a lot better.
But as I pointed out there, last year cp was changed to not use fiemap
to detect holes anymore,
now it uses lseek with SEEK_HOLE mode. So with time, everyone will get
a cp version that does
not use fiemap anymore.

Also, for the cp case, since it does many read and fiemap calls to the
source file, the following
patch probably helps too:

https://lore.kernel.org/linux-btrfs/20220819024408.9714-1-ethanlien@xxxxxxxxxxxx/

Because it will make the io tree smaller. That should land on 6.1 too.

Thanks for testing and the report.

>
> 45.20%--extent_fiemap
> |
> |--31.02%--lock_extent_bits
> | |
> | --30.78%--set_extent_bit
> | |
> | |--6.93%--insert_state
> | | |
> | | --0.70%--set_state_bits
> | |
> | |--4.25%--alloc_extent_state
> | | |
> | | --3.86%--kmem_cache_alloc
> | |
> | |--2.77%--_raw_spin_lock
> | | |
> | | --1.23%--preempt_count_add
> | |
> | |--2.48%--rb_next
> | |
> | |--1.13%--_raw_spin_unlock
> | | |
> | | --0.55%--preempt_count_sub
> | |
> | --0.92%--set_state_bits
> |
> --13.80%--__clear_extent_bit
> |
> --13.30%--clear_state_bit
> |
> | --3.48%--_raw_spin_unlock_irqrestore
> |
> |--2.45%--merge_state.part.0
> | |
> | --1.57%--rb_next
> |
> |--2.14%--__slab_free
> | |
> | --1.26%--cmpxchg_double_slab.constprop.0.isra.0
> |
> |--0.74%--free_extent_state
> |
> |--0.70%--kmem_cache_free
> |
> |--0.69%--btrfs_clear_delalloc_extent
> |
> --0.52%--rb_next
>
>
>
> Thanks!
> --
> Dominique