Re: [PATCH 08/15] mm/filemap: add read support for RWF_UNCACHED

From: Kirill A. Shutemov
Date: Mon Nov 11 2024 - 04:15:39 EST


On Sun, Nov 10, 2024 at 08:28:00AM -0700, Jens Axboe wrote:
> Add RWF_UNCACHED as a read operation flag, which means that any data
> read wil be removed from the page cache upon completion. Uses the page
> cache to synchronize, and simply prunes folios that were instantiated
> when the operation completes. While it would be possible to use private
> pages for this, using the page cache as synchronization is handy for a
> variety of reasons:
>
> 1) No special truncate magic is needed
> 2) Async buffered reads need some place to serialize, using the page
> cache is a lot easier than writing extra code for this
> 3) The pruning cost is pretty reasonable
>
> and the code to support this is much simpler as a result.
>
> You can think of uncached buffered IO as being the much more attractive
> cousing of O_DIRECT - it has none of the restrictions of O_DIRECT. Yes,
> it will copy the data, but unlike regular buffered IO, it doesn't run
> into the unpredictability of the page cache in terms of reclaim. As an
> example, on a test box with 32 drives, reading them with buffered IO
> looks as follows:
>
> Reading bs 65536, uncached 0
> 1s: 145945MB/sec
> 2s: 158067MB/sec
> 3s: 157007MB/sec
> 4s: 148622MB/sec
> 5s: 118824MB/sec
> 6s: 70494MB/sec
> 7s: 41754MB/sec
> 8s: 90811MB/sec
> 9s: 92204MB/sec
> 10s: 95178MB/sec
> 11s: 95488MB/sec
> 12s: 95552MB/sec
> 13s: 96275MB/sec
>
> where it's quite easy to see where the page cache filled up, and
> performance went from good to erratic, and finally settles at a much
> lower rate. Looking at top while this is ongoing, we see:
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 7535 root 20 0 267004 0 0 S 3199 0.0 8:40.65 uncached
> 3326 root 20 0 0 0 0 R 100.0 0.0 0:16.40 kswapd4
> 3327 root 20 0 0 0 0 R 100.0 0.0 0:17.22 kswapd5
> 3328 root 20 0 0 0 0 R 100.0 0.0 0:13.29 kswapd6
> 3332 root 20 0 0 0 0 R 100.0 0.0 0:11.11 kswapd10
> 3339 root 20 0 0 0 0 R 100.0 0.0 0:16.25 kswapd17
> 3348 root 20 0 0 0 0 R 100.0 0.0 0:16.40 kswapd26
> 3343 root 20 0 0 0 0 R 100.0 0.0 0:16.30 kswapd21
> 3344 root 20 0 0 0 0 R 100.0 0.0 0:11.92 kswapd22
> 3349 root 20 0 0 0 0 R 100.0 0.0 0:16.28 kswapd27
> 3352 root 20 0 0 0 0 R 99.7 0.0 0:11.89 kswapd30
> 3353 root 20 0 0 0 0 R 96.7 0.0 0:16.04 kswapd31
> 3329 root 20 0 0 0 0 R 96.4 0.0 0:11.41 kswapd7
> 3345 root 20 0 0 0 0 R 96.4 0.0 0:13.40 kswapd23
> 3330 root 20 0 0 0 0 S 91.1 0.0 0:08.28 kswapd8
> 3350 root 20 0 0 0 0 S 86.8 0.0 0:11.13 kswapd28
> 3325 root 20 0 0 0 0 S 76.3 0.0 0:07.43 kswapd3
> 3341 root 20 0 0 0 0 S 74.7 0.0 0:08.85 kswapd19
> 3334 root 20 0 0 0 0 S 71.7 0.0 0:10.04 kswapd12
> 3351 root 20 0 0 0 0 R 60.5 0.0 0:09.59 kswapd29
> 3323 root 20 0 0 0 0 R 57.6 0.0 0:11.50 kswapd1
> [...]
>
> which is just showing a partial list of the 32 kswapd threads that are
> running mostly full tilt, burning ~28 full CPU cores.
>
> If the same test case is run with RWF_UNCACHED set for the buffered read,
> the output looks as follows:
>
> Reading bs 65536, uncached 0
> 1s: 153144MB/sec
> 2s: 156760MB/sec
> 3s: 158110MB/sec
> 4s: 158009MB/sec
> 5s: 158043MB/sec
> 6s: 157638MB/sec
> 7s: 157999MB/sec
> 8s: 158024MB/sec
> 9s: 157764MB/sec
> 10s: 157477MB/sec
> 11s: 157417MB/sec
> 12s: 157455MB/sec
> 13s: 157233MB/sec
> 14s: 156692MB/sec
>
> which is just chugging along at ~155GB/sec of read performance. Looking
> at top, we see:
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 7961 root 20 0 267004 0 0 S 3180 0.0 5:37.95 uncached
> 8024 axboe 20 0 14292 4096 0 R 1.0 0.0 0:00.13 top
>
> where just the test app is using CPU, no reclaim is taking place outside
> of the main thread. Not only is performance 65% better, it's also using
> half the CPU to do it.
>
> Signed-off-by: Jens Axboe <axboe@xxxxxxxxx>
> ---
> mm/filemap.c | 18 ++++++++++++++++--
> mm/swap.c | 2 ++
> 2 files changed, 18 insertions(+), 2 deletions(-)
>
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 38dc94b761b7..bd698340ef24 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2474,6 +2474,8 @@ static int filemap_create_folio(struct kiocb *iocb,
> folio = filemap_alloc_folio(mapping_gfp_mask(mapping), min_order);
> if (!folio)
> return -ENOMEM;
> + if (iocb->ki_flags & IOCB_UNCACHED)
> + __folio_set_uncached(folio);
>
> /*
> * Protect against truncate / hole punch. Grabbing invalidate_lock
> @@ -2519,6 +2521,8 @@ static int filemap_readahead(struct kiocb *iocb, struct file *file,
>
> if (iocb->ki_flags & IOCB_NOIO)
> return -EAGAIN;
> + if (iocb->ki_flags & IOCB_UNCACHED)
> + ractl.uncached = 1;
> page_cache_async_ra(&ractl, folio, last_index - folio->index);
> return 0;
> }
> @@ -2548,6 +2552,8 @@ static int filemap_get_pages(struct kiocb *iocb, size_t count,
> return -EAGAIN;
> if (iocb->ki_flags & IOCB_NOWAIT)
> flags = memalloc_noio_save();
> + if (iocb->ki_flags & IOCB_UNCACHED)
> + ractl.uncached = 1;
> page_cache_sync_ra(&ractl, last_index - index);
> if (iocb->ki_flags & IOCB_NOWAIT)
> memalloc_noio_restore(flags);
> @@ -2706,8 +2712,16 @@ ssize_t filemap_read(struct kiocb *iocb, struct iov_iter *iter,
> }
> }
> put_folios:
> - for (i = 0; i < folio_batch_count(&fbatch); i++)
> - folio_put(fbatch.folios[i]);
> + for (i = 0; i < folio_batch_count(&fbatch); i++) {
> + struct folio *folio = fbatch.folios[i];
> +
> + if (folio_test_uncached(folio)) {
> + folio_lock(folio);
> + invalidate_complete_folio2(mapping, folio, 0);
> + folio_unlock(folio);

I am not sure it is safe. What happens if it races with page fault?

The only current caller of invalidate_complete_folio2() unmaps the folio
explicitly before calling it. And folio lock prevents re-faulting.

I think we need to give up PG_uncached if we see folio_mapped(). And maybe
also mark the page accessed.

> + }
> + folio_put(folio);
> + }
> folio_batch_init(&fbatch);
> } while (iov_iter_count(iter) && iocb->ki_pos < isize && !error);
>
> diff --git a/mm/swap.c b/mm/swap.c
> index 835bdf324b76..f2457acae383 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -472,6 +472,8 @@ static void folio_inc_refs(struct folio *folio)
> */
> void folio_mark_accessed(struct folio *folio)
> {
> + if (folio_test_uncached(folio))
> + return;

if (folio_test_uncached(folio)) {
if (folio_mapped(folio))
folio_clear_uncached(folio);
else
return;
}

> if (lru_gen_enabled()) {
> folio_inc_refs(folio);
> return;
> --
> 2.45.2
>

--
Kiryl Shutsemau / Kirill A. Shutemov