Re: [RFC] mm/mincore: present-PTE scan cost after pte_batch_hint() batching

From: Barry Song

Date: Tue Jun 09 2026 - 06:03:17 EST

On Tue, Jun 9, 2026 at 5:01 PM David Hildenbrand (Arm) <david@xxxxxxxxxx> wrote:
>
> On 6/9/26 09:26, Chengfeng Lin wrote:
> > Hi,
>
> Hi,
>
> >
> > I found a source-calibrated synthetic mincore() signal in the resident
> > base-page PTE path.
>
> sorry, I'm confused. Did you mean to say "I found a performance regression" ?

I guess so.

>
> > I do not currently have an easy arm64/mTHP validation
> > setup, so before trying to arrange that more expensive validation I would like
> > to ask whether the candidate fix shape below looks reasonable.
> >
> > To keep the scope clear, I am not presenting this as a production application
> > regression report or as a generic mincore() regression. It is a controlled
> > reproducer for a real userspace-visible syscall path, with the page-table shape
> > kept intentionally simple:
> >
> > mmap() private anonymous memory
> > madvise(MADV_NOHUGEPAGE)
> > fault in all pages
> > repeatedly call mincore() over a resident 64 MiB range
>
> Okay, so I assume a mincore() regression. On arm64?
>
> >
> > The practical hook is that mincore() is the userspace-visible residency query
> > for an address range. The resident anonymous no-THP range is intended to
> > isolate the base-page present-PTE scan and avoid file cache, swap, THP, marker,
> > and unmapped-range effects. I would read the result as source-path evidence for
> > the hot path below, not as evidence that every mincore() caller or a specific
> > application workload regressed.
>
> This reads very obscure and cryptic. Was that written by, or translated by an LLM?
>
> The way it's phrased makes it a bit hard to digest.

I really suggest we first write in plain, imperfect English and then
use an LLM to refine it.

A direct translation can sometimes make it very hard for readers to
understand.

I think I share the same native language as Chengfeng, but I still
find this email difficult to read.

>
> >
> > The intended hot path is:
> >
> > mincore()
> > -> walk_page_range()
> > -> mincore_pte_range()
> >
> > The main metric is mincore_ns_per_1k_pages, lower is better. It is the
> > wall-clock time spent in the mincore() scan, normalized by the number of pages
> > covered by the range and reported as nanoseconds per 1000 pages scanned.
> >
> > As a release-level starting point, the matched-PREEMPT 1/2/4 CPU bridge below
> > uses original release kernels, QEMU direct boot, 9 repetitions, coverage
> > disabled, and the same CONFIG_ADVISE_SYSCALLS setup:

Sometimes QEMU produces highly distorted performance numbers.
better to re-test on physical CPUs.

> >
> > scenario: no_thp_pte_scan_64m
> > metric: mincore_ns_per_1k_pages, lower is better
> >
> > CPU v6.12.77 v6.18.19 v6.19.9 v7.0.9
> > 1 12827.667 15677.444 16482.667 16726.333
> > 2 13628.444 16102.333 18256.889 17270.333
> > 4 13798.222 16739.333 18892.111 17068.222
>
> Okay, so we see two steps of "degradation". I assume this code is so performance
> sensitive that even compiler changes might easily affect it. Because all we do
> is scan page tables for present entries.
>
> The mincore optimization went into v6.16.
>
> >
> > This shows cumulative cost relative to v6.12 in the primary 1/2/4 CPU matrix.
> > I also reran the 8CPU/16CPU release-level bridge on the same scenario. These
> > rows show the same general direction, but the shared lab was busy during this
> > rerun and the high-CPU rows have higher CV, so I include them as extended
> > context only:
> >
> > CPU/mem v6.12.77 v6.18.19 v6.19.9 v7.0.9
> > 8/16 GiB 17251.889 23335.556 21863.556 21664.778
> > 16/32 GiB 16697.333 21428.333 21629.778 21628.333
>
> I don't think measuring concurrency here really makes a lot of sense.
>
> Especially, as it's becoming a rather weird, unrealistic micro-benchmark that way.

I'm not sure multithreading is involved here, since the reproducer
is single-threaded:

https://github.com/lcf0399/linux-mm-regression-evidence-2026-05/blob/main/mincore-present-pte-scan/reproducer/mincore_present_pte_scan.c

>
> >
> > The 16CPU rerun had two QEMU returncode-139 failures in the original 36-run
> > matrix; I filled the missing v6.12/v6.18 samples with a clean two-run
> > supplement. I therefore use the high-CPU rows as context for the release
> > bridge, not as part of the primary matrix.
> >
> > Follow-up release-ladder and A/B testing narrowed the main step to the
> > v6.15 -> v6.16 window. The strongest suspect is:
> >
> > 4df65651f7075 ("mm: mincore: use pte_batch_hint() to batch process large folios")
> >
> > That patch improved the mTHP/large-folio case, but in this base-page resident
> > PTE scan I see a sizeable cost. The original commit message mentioned that
> > base pages did not show an obvious regression, so this may simply be a
> > different x86/base-page corner than the original arm64/mTHP test.
>
> Okay, so it is on x86 then?
>
> On x86, pte_batch_hint() is hard-coded at 1, so the expectation is that the loop
> and everything should get completely optimized out.
>
> >
> > For the v6.16 introduction-window A/B, all rows below are lab QEMU direct boot,
> > 9 repetitions, coverage disabled, same PREEMPT and CONFIG_ADVISE_SYSCALLS
> > setup. The 1/2/4 CPU rows are the primary matrix; the 8CPU/16GiB and
> > 16CPU/32GiB rows are the high-CPU follow-up:

I don’t really understand what the “primary matrix” is, or what
“high-CPU” means.

> >
> > scenario: no_thp_pte_scan_64m
> > metric: mincore_ns_per_1k_pages, lower is better
> >
> > CPU/mem v6.15 v6.16 v6.16 batch<=1 fastpath v6.16 nobatch
> > 1 12946.889 17117.667 14560.556 13843.222
> > 2 15053.111 18214.667 15714.778 14270.556
> > 4 14942.000 18338.222 14397.889 14719.667
> > 8/16 GiB 15046.444 17540.222 13696.333 13200.000
> > 16/32 GiB 14674.111 18928.889 13949.000 15351.111
> >
> > The high-CPU matrix completed 72/72 with all_cpu_match=true,
> > any_noapic=false, all_autorun_exit0=true, and all_semantic_ok=true. One v6.15
> > 16CPU timing sample in the main matrix was an obvious outlier, so the 16/32 GiB
> > v6.15 value above uses a clean v6.15-only 9-repeat supplement.
> >
> > I also ran ftrace attribution on the same path as mechanism evidence, not as
> > clean timing. In that run, v6.16 original had a higher mincore_pte_range
> > average than v6.15, v6.16-nobatch, and the batch<=1 fastpath:
> >
> > kernel mincore_pte_range avg_us
> > v6.15-mainline-preempt 6.040
> > v6.16-mainline-preempt 7.899
> > v6.16-mainline-nobatch 6.031
> > v6.16-mainline-fastpath 6.103
> >
> > The smaller batch<=1 fastpath helped, but later v6.18 testing suggested the
> > remaining cost was more about the hot present-PTE branch layout. The candidate
> > shape I tested is to check pte_present() first, while keeping pte_batch_hint()
> > for batch > 1:
> >
> > if (pte_present(pte)) {
> > batch = pte_batch_hint(ptep, pte);
> > if (batch > 1)
> > fill vec[0..step-1];
> > else
> > *vec = 1;
> > } else if (pte_none(pte) || pte_is_marker(pte)) {
> > __mincore_unmapped_range(...);
> > } else {
> > mincore_swap(...);
> > }
> >
> > On x86, pte_batch_hint() defaults to 1, so this mainly measures the
> > resident-PTE hot path layout. On arm64 the batch > 1 path should still be
> > preserved, but I have not validated mTHP/contiguous-PTE performance yet.

Is the performance improvement mainly because the tested PTEs are always
present, so some conditional branches are avoided by the patch:
https://github.com/lcf0399/linux-mm-regression-evidence-2026-05/blob/main/mincore-present-pte-scan/patches/mincore-present-first-fastpath-rfc.patch

It looks like the gain does not come from fixing an actual issue in
the existing logic in any case.

> >
> > The v6.18 confirmation A/B. The 1/2/4 CPU rows are the primary matrix; the
> > 8CPU/16GiB and 16CPU/32GiB rows are the high-CPU follow-up. All rows use the
> > same no-THP scenario, 9 repetitions, and coverage disabled:
>
>
> Which compiler are you using?
>
> The expectation is that the whole code would get optimized on x86 such that the
> behavior is just like before.

yes.
Likely due to QEMU behavior and some compiler-related effects.

Best Regards
Barry