[RFC PATCH 0/1] mm/filemap: tighten mmap_miss hit accounting

From: fujunjie

Date: Mon Apr 27 2026 - 06:25:02 EST

Hi,

This RFC explores a narrow mmap readahead accounting issue in
filemap_map_pages().

Today, mmap_miss is increased when synchronous mmap readahead is needed,
and decreased when filemap_map_pages() maps folios that are already in
the page cache. The decrease side can over-credit hits in two cases:

- fault-around installs nearby PTEs even though the fault only proves
that the faulting address was accessed;
- after synchronous mmap readahead returns VM_FAULT_RETRY, the retry
can find the folio brought in by the same miss and immediately
cancel that miss.

This first RFC keeps the scope intentionally conservative:

- only credit a hit when filemap_map_pages() maps the actual faulting
address;
- do not credit FAULT_FLAG_TRIED retries as mmap hits;
- keep the existing workingset-folio behavior unchanged;
- do not change async mmap readahead hit accounting.

Current evidence supports that the change helps sparse random mmap
access and sparse strides that do not geometrically overlap with the
read-around window. The main data set is a local KVM/data-disk
microbenchmark using mmap_miss_probe, with an 8 GiB guest, 2 vCPUs,
8192 KiB read_ahead_kb, cold page cache before each run, and medians
from 3 runs.

mmap_miss_probe is a small userspace benchmark used only for these
measurements. It mmap()s a prepared file with MADV_NORMAL and then
touches one byte at selected base-page offsets; the access order is
random, sequential, or a fixed page stride. The harness drops caches
before each run and samples /proc/vmstat around that access loop.

Here "pressure" means file-cache capacity pressure from a 20 GiB file in
an 8 GiB guest. It is not an extra memhog workload. The fit-in-memory
case uses a 4 GiB file in the same 8 GiB guest.

Each case used a fresh temporary qcow2 data disk, seen by the guest as
/dev/vda, formatted as ext4 and mounted at /mnt/mmap-matrix.

Each result is "pgpgin GiB / elapsed seconds". "pgpgin GiB" is the
delta of the guest /proc/vmstat pgpgin counter, converted from KiB to
GiB; I use it as an approximate block input counter, not as resident
memory or exact application IO. "Elapsed seconds" is the wall-clock
runtime of the whole mmap_miss_probe access pass, not per-access
latency.

For the 20 GiB pressure case with 1% of pages accessed:

workload before after
random 223.377 GiB/101.293s 1.010 GiB/4.790s
stride1021 204.214 GiB/97.557s 204.208 GiB/108.086s
stride2053 409.584 GiB/193.700s 0.970 GiB/3.685s
stride4099 406.452 GiB/134.241s 0.975 GiB/3.499s
sequential 0.212 GiB/0.050s 0.212 GiB/0.057s

For the 4 GiB fit-in-memory case in the same 8 GiB guest:

workload before after
random 3.987 GiB/1.960s 0.980 GiB/1.221s
stride1021 4.002 GiB/1.838s 4.002 GiB/1.851s
stride2053 3.991 GiB/1.835s 0.811 GiB/0.985s
stride4099 4.001 GiB/1.836s 0.819 GiB/1.037s
sequential 0.056 GiB/0.013s 0.056 GiB/0.018s

The same 8 GiB pressure setup also has an ablation. P1 is only the
faulting-address hit accounting change. P2-only is only the
FAULT_FLAG_TRIED retry filter. P1+P2 is this RFC. A representative
subset of that ablation is:

workload variant result
random baseline 223.377 GiB/101.293s
random P1 223.268 GiB/98.481s
random P2-only 223.257 GiB/100.091s
random P1+P2 1.010 GiB/4.790s
stride2053 baseline 409.584 GiB/193.700s
stride2053 P1 409.584 GiB/197.645s
stride2053 P2-only 15.722 GiB/5.485s
stride2053 P1+P2 0.970 GiB/3.685s
sequential baseline 0.212 GiB/0.050s
sequential P1 0.212 GiB/0.046s
sequential P2-only 0.212 GiB/0.050s
sequential P1+P2 0.212 GiB/0.057s

This supports keeping the RFC scoped to the two accounting changes:
P1 alone was effectively baseline, while P2-only helped large sparse
strides under pressure but left random access at baseline-level IO.
I also tried variants that changed async mmap readahead and workingset
handling; in this data set they tracked P1+P2 closely, so I left them
out of this RFC.

Current evidence does not establish that this solves every sparse
pattern. The stride1021 rows above are intentionally included: the
20 GiB run still reads about 204 GiB.

In the table, strideN means that the benchmark advances by N base pages
between mmap loads. Thus stride1021 is 1021 * 4 KiB = 4084 KiB. With
8192 KiB read_ahead_kb, file->f_ra.ra_pages is 2048 base pages, and
synchronous mmap read-around uses a 2048-page window centered around the
fault, i.e. roughly [index - 1024, index + 1023]. A stride1021 access
therefore lands inside the previous read-around window. About every
other access can be a real faulting-address page-cache hit, and the
other half can each read about 8 MiB. For about 52k accesses in the
20 GiB/1% run, half of them times 8 MiB is about 205 GiB, which matches
the observed 204 GiB.

Feedback on the accounting boundary and on suitable test coverage would
be useful.

I will be travelling next week, so I may be slow to reply.

Best regards.

fujunjie

fujunjie (1):
mm/filemap: tighten mmap_miss hit accounting

mm/filemap.c | 33 ++++++++++++++++-----------------
1 file changed, 16 insertions(+), 17 deletions(-)

base-commit: 1b55f8358e35a67bf3969339ea7b86988af92f66
--
2.34.1