Re: [RFC PATCH 1/1] mm/filemap: tighten mmap_miss hit accounting

From: Jan Kara

Date: Mon Apr 27 2026 - 09:25:36 EST

On Mon 27-04-26 10:22:43, fujunjie wrote:
> file->f_ra.mmap_miss is used to stop mmap readahead after repeated
> misses. filemap_fault() increases it when synchronous mmap readahead is
> needed, while filemap_map_pages() reduces it when fault-around finds
> folios already present in the page cache.
>
> The hit side of that accounting is too generous in two cases.
>
> First, fault-around can install PTEs for multiple pages around the
> faulting address. The fault only proves that the faulting address was
> accessed, not that the nearby PTEs will be used by the workload.
> Crediting all of those nearby PTEs as mmap hits can make sparse random
> access look like successful mmap readahead and keep mmap readahead
> enabled for longer than intended.
>
> Second, a fault that misses in the page cache can start synchronous mmap
> readahead, drop the mmap_lock, and return VM_FAULT_RETRY. The retry may
> then find the folio that this same fault pulled into the page cache. If
> filemap_map_pages() credits that retry as a hit, the same miss can
> immediately cancel its own mmap_miss increase.
>
> Only credit one mmap hit when filemap_map_pages() actually maps the
> faulting address. Also skip the credit on FAULT_FLAG_TRIED retries.
> Keep the existing workingset behavior: recently refaulted folios still
> do not reduce mmap_miss.
>
> Current evidence comes from a local KVM/data-disk microbenchmark using
> mmap_miss_probe. In an 8 GiB guest with 2 vCPUs, a 20 GiB file,
> 8192 KiB read_ahead_kb, cold page cache before each run, and 1% of the
> file accessed, the median of 3 runs changed as follows. This is file
> cache capacity pressure from the file being larger than guest memory; no
> separate memory hog was used.
>
> mmap_miss_probe is a small userspace benchmark used only for these
> measurements. It mmap()s a prepared file with MADV_NORMAL and then
> touches one byte at selected base-page offsets; the access order is
> random, sequential, or a fixed page stride. The harness drops caches
> before each run and samples /proc/vmstat around that access loop.
>
> Each case used a fresh temporary qcow2 data disk, seen by the guest as
> /dev/vda, formatted as ext4 and mounted at /mnt/mmap-matrix.
>
> Each before/after entry is "pgpgin GiB / elapsed seconds". "pgpgin GiB"
> is the delta of the guest /proc/vmstat pgpgin counter, converted from
> KiB to GiB; I use it as an approximate block input counter, not as
> resident memory or exact application IO. "Elapsed seconds" is the
> wall-clock runtime of the whole mmap_miss_probe access pass, not
> per-access latency.
>
> workload before after
> random 223.377 GiB/101.293s 1.010 GiB/4.790s
> stride1021 204.214 GiB/97.557s 204.208 GiB/108.086s
> stride2053 409.584 GiB/193.700s 0.970 GiB/3.685s
> stride4099 406.452 GiB/134.241s 0.975 GiB/3.499s
> sequential 0.212 GiB/0.050s 0.212 GiB/0.057s
>
> The same 8 GiB guest with a 4 GiB file, so the file fits in memory,
> showed the same direction for sparse random access without file-cache
> reclaim pressure:
>
> workload before after
> random 3.987 GiB/1.960s 0.980 GiB/1.221s
> stride1021 4.002 GiB/1.838s 4.002 GiB/1.851s
> stride2053 3.991 GiB/1.835s 0.811 GiB/0.985s
> stride4099 4.001 GiB/1.836s 0.819 GiB/1.037s
> sequential 0.056 GiB/0.013s 0.056 GiB/0.018s
>
> This RFC does not claim to solve every sparse pattern. In particular,
> the stride1021 rows above are intentionally included: the 20 GiB run is
> still about 204 GiB of pgpgin.
>
> In the table, strideN means that the benchmark advances by N base pages
> between mmap loads. Thus stride1021 is 1021 * 4 KiB = 4084 KiB. With
> 8192 KiB read_ahead_kb, file->f_ra.ra_pages is 2048 base pages, and
> synchronous mmap read-around uses a 2048-page window centered around the
> fault, i.e. roughly [index - 1024, index + 1023]. A stride1021 access
> therefore lands inside the previous read-around window. About every
> other access can be a real faulting-address page-cache hit, and the
> other half can each read about 8 MiB. For about 52k accesses in the
> 20 GiB/1% run, half of them times 8 MiB is about 205 GiB, which matches
> the observed 204 GiB. This first version keeps the scope intentionally
> limited to mmap_miss hit accounting.
>
> Signed-off-by: fujunjie <fujunjie1@xxxxxx>

Thanks for the patch! I agree with the changes in the logic. In fact I was
proposing something very similar some time ago [1] but it wasn't fixing the
problem reported back then and I never got to checking whether there are
some other workloads that would benefit. So kudos to you doing that. Some
comments to implementation:

1) These are two separate logical changes - handling of filemap_map_pages()
and handling of FAULT_FLAG_TRIED. Please create two separate patches for
them.

2) After changing the mmap_miss logic in filemap_map_pages() there's no
need for the odd propagation of mmap_miss variable to
filemap_map_order0_folio() and filemap_map_folio_range(). Now you are
guaranteed to need to update mmap_miss at by at most 1. So I think you
should just drop mmap_miss argument, check return value of
filemap_map_order0_folio() / filemap_map_folio_range() in
filemap_map_pages() and based on it just update file->f_ra.mmap_miss if
appropriate. Much simpler.

[1] https://lore.kernel.org/all/20240201173130.frpaqpy7iyzias5j@quack3/

Honza

> diff --git a/mm/filemap.c b/mm/filemap.c
> index 4e636647100c1..463cd19c49f09 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -3757,6 +3757,7 @@ static vm_fault_t filemap_map_folio_range(struct vm_fault *vmf,
> unsigned int count = 0;
> pte_t *old_ptep = vmf->pte;
> unsigned long addr0;
> + bool fault_mapped = false;
>
> /*
> * Map the large folio fully where possible:
> @@ -3780,16 +3781,6 @@ static vm_fault_t filemap_map_folio_range(struct vm_fault *vmf,
> if (PageHWPoison(page + count))
> goto skip;
>
> - /*
> - * If there are too many folios that are recently evicted
> - * in a file, they will probably continue to be evicted.
> - * In such situation, read-ahead is only a waste of IO.
> - * Don't decrease mmap_miss in this scenario to make sure
> - * we can stop read-ahead.
> - */
> - if (!folio_test_workingset(folio))
> - (*mmap_miss)++;
> -
> /*
> * NOTE: If there're PTE markers, we'll leave them to be
> * handled in the specific fault path, and it'll prohibit the
> @@ -3806,8 +3797,10 @@ static vm_fault_t filemap_map_folio_range(struct vm_fault *vmf,
> *rss += count;
> folio_ref_add(folio, count - ref_from_caller);
> ref_from_caller = 0;
> - if (in_range(vmf->address, addr, count * PAGE_SIZE))
> + if (in_range(vmf->address, addr, count * PAGE_SIZE)) {
> ret = VM_FAULT_NOPAGE;
> + fault_mapped = true;
> + }
> }
>
> count++;
> @@ -3822,8 +3815,10 @@ static vm_fault_t filemap_map_folio_range(struct vm_fault *vmf,
> *rss += count;
> folio_ref_add(folio, count - ref_from_caller);
> ref_from_caller = 0;
> - if (in_range(vmf->address, addr, count * PAGE_SIZE))
> + if (in_range(vmf->address, addr, count * PAGE_SIZE)) {
> ret = VM_FAULT_NOPAGE;
> + fault_mapped = true;
> + }
> }
>
> vmf->pte = old_ptep;
> @@ -3831,6 +3826,10 @@ static vm_fault_t filemap_map_folio_range(struct vm_fault *vmf,
> /* Locked folios cannot get truncated. */
> folio_ref_dec(folio);
>
> + if (fault_mapped && !(vmf->flags & FAULT_FLAG_TRIED) &&
> + !folio_test_workingset(folio))
> + (*mmap_miss)++;
> +
> return ret;
> }
>
> @@ -3844,10 +3843,6 @@ static vm_fault_t filemap_map_order0_folio(struct vm_fault *vmf,
> if (PageHWPoison(page))
> goto out;
>
> - /* See comment of filemap_map_folio_range() */
> - if (!folio_test_workingset(folio))
> - (*mmap_miss)++;
> -
> /*
> * NOTE: If there're PTE markers, we'll leave them to be
> * handled in the specific fault path, and it'll prohibit
> @@ -3856,8 +3851,12 @@ static vm_fault_t filemap_map_order0_folio(struct vm_fault *vmf,
> if (!pte_none(ptep_get(vmf->pte)))
> goto out;
>
> - if (vmf->address == addr)
> + if (vmf->address == addr) {
> ret = VM_FAULT_NOPAGE;
> + if (!(vmf->flags & FAULT_FLAG_TRIED) &&
> + !folio_test_workingset(folio))
> + (*mmap_miss)++;
> + }
>
> set_pte_range(vmf, folio, page, 1, addr);
> (*rss)++;
> --
> 2.34.1
>
--
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR