Re: [linus:master] [readahead] ab4443fe3c: vm-scalability.throughput -21.4% regression
From: Jan Kara
Date: Thu Mar 07 2024 - 04:24:26 EST
On Mon 04-03-24 13:35:10, Yin, Fengwei wrote:
> Hi Jan,
>
> On 3/4/2024 12:59 PM, Yujie Liu wrote:
> > From the perf profile, we can see that the contention of folio lru lock
> > becomes more intense. We also did a simple one-file "dd" test. Looks
> > like it is more likely that low-order folios are allocated after commit
> > ab4443fe3c (Fengwei will help provide the data soon). Therefore, the
> > average folio size decreases while the total folio amount increases,
> > which leads to touching lru lock more often.
>
> I did following testing:
> With a xfs image in tmpfs + mount it to /mnt and create 12G test file
> (sparse-file), use one process to read it on a Ice Lake machine with
> 256G system memory. So we could make sure we are doing a sequential
> file read with no page reclaim triggered.
>
> At the same time, profiling the distribution of order parameter of
> filemap_alloc_folio() call to understand how the large folio order
> for page cache is generated.
>
> Here is what we got:
>
> - Commit f0b7a0d1d46625db:
> $ dd bs=4k if=/mnt/sparse-file of=/dev/null
> 3145728+0 records in
> 3145728+0 records out
> 12884901888 bytes (13 GB, 12 GiB) copied, 2.52208 s, 5.01 GB/s
>
> filemap_alloc_folio
> page order : count distribution
> 0 : 57 | |
> 1 : 0 | |
> 2 : 20 | |
> 3 : 2 | |
> 4 : 4 | |
> 5 : 98300 |****************************************|
>
> - Commit ab4443fe3ca6:
> $ dd bs=4k if=/mnt/sparse-file of=/dev/null
> 3145728+0 records in
> 3145728+0 records out
> 12884901888 bytes (13 GB, 12 GiB) copied, 2.51469 s, 5.1 GB/s
>
> filemap_alloc_folio
> page order : count distribution
> 0 : 21 | |
> 1 : 0 | |
> 2 : 196615 |****************************************|
> 3 : 98303 |******************* |
> 4 : 98303 |******************* |
>
>
> Even the file read throughput is almost same. But the distribution of
> order looks like a regression with ab4443fe3ca6 (more smaller order
> page cache is generated than parent commit). Thanks.
Thanks for testing! This is an interesting result and certainly unexpected
for me. The readahead code allocates naturally aligned pages so based on
the distribution of allocations it seems that before commit ab4443fe3ca6
readahead window was at least 32 pages (128KB) aligned and so we allocated
order 5 pages. After the commit, the readahead window somehow ended up only
aligned to 20 modulo 32. To follow natural alignment and fill 128KB
readahead window we allocated order 2 page (got us to offset 24 modulo 32),
then order 3 page (got us to offset 0 modulo 32), order 4 page (larger
would not fit in 128KB readahead window now), and order 2 page to finish
filling the readahead window.
Now I'm not 100% sure why the readahead window alignment changed with
different rounding when placing readahead mark - probably that's some
artifact when readahead window is tiny in the beginning before we scale it
up (I'll verify by tracing whether everything ends up looking correctly
with the current code). So I don't expect this is a problem in ab4443fe3ca6
as such but it exposes the issue that readahead page insertion code should
perhaps strive to achieve better readahead window alignment with logical
file offset even at the cost of occasionally performing somewhat shorter
readahead. I'll look into this once I dig out of the huge heap of email
after vacation...
Honza
--
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR