Re: [PATCH] mm: limit filemap_fault readahead to VMA boundaries

From: Jan Kara

Date: Wed Apr 22 2026 - 06:15:15 EST

On Tue 21-04-26 17:56:07, Frederick Mayle wrote:
> When a file mapping covers a strict subset of a file, an access to the
> mapping can trigger readahead of file pages outside the mapped region.
> Readahead is meant to prefetch pages likely to be accessed soon, but
> these pages aren't accessible via the same means, so it fair to say we
> don't have a good indicator they'll be accessed soon. Take an ELF file
> for example: An access to the end of a program's read-only segment isn't
> a sign that nearby file contents will be accessed next (they are likely
> to be mapped discontiguously, or not at all). The pressure from loading
> these pages into the cache can evict more useful pages.
>
> To improve the behavior, make three changes:
>
> * Introduce a new readahead_control option, max_index, as a hard limit
> on the readahead. The existing file_ra_state->size can't be used as a
> limit, it is more of a hint and can be increased by various
> heuristics.
> * Set readahead_control->max_index to the end of the VMA in all of the
> readahead paths that can be triggered from a fault on a file mapping
> (both "sync" and "async" readahead).
> * Limit the read-around range start to the VMA's start.
>
> Note that these changes only affect readahead triggered in the context
> of a fault, they do not affect readahead triggered by read syscalls. If
> a user mixes the two types of accesses, the behavior is expected to be
> the following: if a fault causes readahead and places a PG_readahead
> marker and then a read(2) syscall hits the PG_readahead marker, the
> resulting async readahead *will not* be limited to the VMA end.
> Conversely, if a read(2) syscall places a PG_readahead marker and then a
> fault hits the marker, the async readahead *will* be limited to the VMA
> end.
>
> There is an edge case that the above motivation glosses over: A single
> file mapping might be backed by multiple VMAs. For example, a whole file
> could be mapped RW, then part of the mapping made RO using mprotect.
> This patch would hurt performance of a sequential read of such a
> mapping, the degree depending on how fragmented the VMAs are. A usage
> pattern like that is likely rare and already suffering from sub-optimal
> performance because, e.g., the fragmented VMAs limit the fault-around,
> so each VMA boundary in a sequential read would cause a minor fault.
> Still, this would make it worse. See a previous discussion of this topic
> at [1].
>
> Tested by mapping and reading a small subset of a large file, then using
> the cachestat syscall to verify the number of cached pages didn't exceed
> the mapping size.
>
> In practical scenarios, the effect depends on the specific file and
> usage. Sometimes there is no effect at all, but, for some ELF files in
> Android, we see ~20% fewer pages pull into the cache.
>
> A comprehensive performance evaluation hasn't been done, but, in
> addition to the anecdontal memory savings mentioned above, a benchmark
> was run with fio 3.38, showing neutral looking results:
>
> /data/local/tmp/fio --version
>
> fio --name=mmap_test --ioengine=mmap --rw=read --bs=4k \
> --offset=1G --size=1G --filesize=3G --numjobs=1 \
> --filename=testfile.bin
>
> Before: 4366.6 MiB/s (avg of 3459, 4592, 4613, 4697, 4472)
> After: 4444.0 MiB/s (avg of 4633, 4655, 4511, 4571, 3850)
> +1.7%
>
> Same, with --ioengine=mmap --rw=randread
>
> Before: 445.6 MiB/s (avg of 446, 447, 442, 452, 441)
> After: 447.0 MiB/s (avg of 447, 446, 446, 451, 445)
> +0.3%
>
> Same, with --ioengine=psync --rw=read
>
> Before: 3086.6 MiB/s (avg of 3122, 3094, 3066, 3094, 3057)
> After: 3084.6 MiB/s (avg of 3039, 3103, 3103, 3084, 3094)
> -0.06%
>
> Same, with --ioengine=psync --rw=randread
>
> Before: 2226.4 MiB/s (avg of 2256, 2183, 2207, 2265, 2221)
> After: 2231.4 MiB/s (avg of 2236, 2241, 2236, 2193, 2251)
> +0.2%
>
> [1] https://lore.kernel.org/all/ivnv2crd3et76p2nx7oszuqhzzah756oecn5yuykzqfkqzoygw@yvnlkhjjssoz/
>
> Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
> Cc: David Hildenbrand <david@xxxxxxxxxx>
> Cc: Jan Kara <jack@xxxxxxx>
> Cc: Kalesh Singh <kaleshsingh@xxxxxxxxxx>
> Cc: Lorenzo Stoakes <ljs@xxxxxxxxxx>
> Cc: Matthew Wilcox <willy@xxxxxxxxxxxxx>
> Cc: Suren Baghdasaryan <surenb@xxxxxxxxxx>
> Cc: android-mm@xxxxxxxxxx
> Cc: kernel-team@xxxxxxxxxxx
> Signed-off-by: Frederick Mayle <fmayle@xxxxxxxxxx>

Looks good to me. Thanks! Feel free to add:

Reviewed-by: Jan Kara <jack@xxxxxxx>

Honza

> ---
> include/linux/pagemap.h | 2 ++
> mm/filemap.c | 4 ++++
> mm/readahead.c | 5 ++++-
> 3 files changed, 10 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index ec442af3f886..cc628050bc5e 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -1366,6 +1366,7 @@ struct readahead_control {
> bool dropbehind;
> bool _workingset;
> unsigned long _pflags;
> + unsigned long max_index; /* limit readahead to i<=max_index */
> };
>
> #define DEFINE_READAHEAD(ractl, f, r, m, i) \
> @@ -1374,6 +1375,7 @@ struct readahead_control {
> .mapping = m, \
> .ra = r, \
> ._index = i, \
> + .max_index = ULONG_MAX, \
> }
>
> #define VM_READAHEAD_PAGES (SZ_128K / PAGE_SIZE)
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 4e636647100c..d2f6bef12f58 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -3314,6 +3314,8 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
> bool force_thp_readahead = false;
> unsigned short mmap_miss;
>
> + ractl.max_index = vmf->vma->vm_pgoff + vma_pages(vmf->vma) - 1;
> +
> /* Use the readahead code, even if readahead is disabled */
> if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) &&
> (vm_flags & VM_HUGEPAGE) && HPAGE_PMD_ORDER <= MAX_PAGECACHE_ORDER)
> @@ -3396,6 +3398,7 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
> * mmap read-around
> */
> ra->start = max_t(long, 0, vmf->pgoff - ra->ra_pages / 2);
> + ra->start = max(ra->start, vmf->vma->vm_pgoff);
> ra->size = ra->ra_pages;
> ra->async_size = ra->ra_pages / 4;
> ra->order = 0;
> @@ -3438,6 +3441,7 @@ static struct file *do_async_mmap_readahead(struct vm_fault *vmf,
> }
>
> if (folio_test_readahead(folio)) {
> + ractl.max_index = vmf->vma->vm_pgoff + vma_pages(vmf->vma) - 1;
> fpin = maybe_unlock_mmap_for_io(vmf, fpin);
> page_cache_async_ra(&ractl, folio, ra->ra_pages);
> }
> diff --git a/mm/readahead.c b/mm/readahead.c
> index 7b05082c89ea..95a424b2f3a3 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -324,6 +324,8 @@ static void do_page_cache_ra(struct readahead_control *ractl,
> return;
>
> end_index = (isize - 1) >> PAGE_SHIFT;
> + if (end_index > ractl->max_index)
> + end_index = ractl->max_index;
> if (index > end_index)
> return;
> /* Don't read past the page containing the last byte of the file */
> @@ -471,7 +473,8 @@ void page_cache_ra_order(struct readahead_control *ractl,
> pgoff_t start = readahead_index(ractl);
> pgoff_t index = start;
> unsigned int min_order = mapping_min_folio_order(mapping);
> - pgoff_t limit = (i_size_read(mapping->host) - 1) >> PAGE_SHIFT;
> + pgoff_t limit = min_t(pgoff_t, (i_size_read(mapping->host) - 1) >> PAGE_SHIFT,
> + ractl->max_index);
> pgoff_t mark = index + ra->size - ra->async_size;
> unsigned int nofs;
> int err = 0;
>
> base-commit: db2a1695b2b6feb071b47b72e61d0359bf1524bf
> --
> 2.54.0.rc1.555.g9c883467ad-goog
>
--
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR