Re: Re: [RFC] mm/mincore: present-PTE scan cost after pte_batch_hint() batching
From: Chengfeng Lin
Date: Tue Jun 09 2026 - 10:19:10 EST
Hi David,
Thanks, and sorry for the confusing wording. The plain statement is: I
observed a performance difference in a narrow x86/QEMU synthetic mincore()
case, and after your comment I checked whether this is really a codegen issue.
The wording in my first mail was too abstract. What I was trying to say is
only that the benchmark focuses on one specific case:
private anonymous memory
MADV_NOHUGEPAGE
faulted/resident base pages
repeated mincore() over the range
so the measured path should mostly be the present-PTE scan in
mincore_pte_range(). I agree that the 8/16 CPU rows are not very useful for
this path; please treat them as extra context only. The useful data is the
single-threaded / low-CPU v6.15 -> v6.16 A/B and the patched variants.
The compiler used for the lab kernels was:
gcc (Ubuntu 13.3.0-6ubuntu2~24.04.1) 13.3.0
GNU ld (GNU Binutils for Ubuntu) 2.42
Your point about x86 pte_batch_hint() is exactly the right thing to check.
Since pte_batch_hint() returns 1 on x86, I agree that the expectation would be
for the compiler to optimize the batching logic back down to something very
close to the old base-page path.
I checked the generated mincore_pte_range() code with the same GCC/config setup.
The function sizes from nm are:
v6.15 original: 0x1fb
v6.16 original: 0x245
v6.16 batch<=1 fastpath: 0x1ec
v6.16 with batching removed: 0x1ec
So, with GCC 13.3, the v6.16 original build does not look optimized back to the
old x86 base-page shape. The v6.16 batch<=1 fastpath and the v6.16 nobatch
variant produce the same mincore_pte_range() objdump output in my build.
I also checked Clang 18.1.3 as a cross-check. With Clang, v6.15 original,
v6.16 original, v6.16 batch<=1 fastpath and v6.16 nobatch all produce the same
mincore_pte_range() size, 0x1f9, and the objdump output is byte-identical.
So your expectation does hold with Clang, but not with the GCC 13.3 build I used
for the original lab runs. This does not prove a compiler bug, and it means my
original report should be narrowed: it is not a generic x86 mincore()
regression claim. In this check, GCC 13.3 generates a different
mincore_pte_range() shape for v6.16 original, while Clang 18.1.3 generates
byte-identical output for all checked variants. The timing signal I reported
came from the GCC-built QEMU lab kernels.
I put the compact codegen summary and the relevant nm/objdump snippets here:
https://github.com/lcf0399/linux-mm-regression-evidence-2026-05/tree/a5e3312deafd97321aa99a32772180989949fa59/mincore-present-pte-scan/codegen
Thanks,
Chengfeng
> -----原始邮件-----
> 发件人: "David Hildenbrand (Arm)" <david@xxxxxxxxxx>
> 发送时间:2026-06-09 17:01:51 (星期二)
> 收件人: "Chengfeng Lin" <chengfenglin@xxxxxxxxxxxxxx>, "Andrew Morton" <akpm@xxxxxxxxxxxxxxxxxxxx>
> 抄送: "Liam R. Howlett" <liam@xxxxxxxxxxxxx>, "Lorenzo Stoakes" <ljs@xxxxxxxxxx>, "Vlastimil Babka" <vbabka@xxxxxxxxxx>, "Jann Horn" <jannh@xxxxxxxxxx>, "Pedro Falcato" <pfalcato@xxxxxxx>, linux-mm@xxxxxxxxx, linux-kernel@xxxxxxxxxxxxxxx, "Baolin Wang" <baolin.wang@xxxxxxxxxxxxxxxxx>, "Barry Song" <baohua@xxxxxxxxxx>, "Dev Jain" <dev.jain@xxxxxxx>, "Ryan Roberts" <ryan.roberts@xxxxxxx>, "Zi Yan" <ziy@xxxxxxxxxx>
> 主题: Re: [RFC] mm/mincore: present-PTE scan cost after pte_batch_hint() batching
>
> On 6/9/26 09:26, Chengfeng Lin wrote:
> > Hi,
>
> Hi,
>
> >
> > I found a source-calibrated synthetic mincore() signal in the resident
> > base-page PTE path.
>
> sorry, I'm confused. Did you mean to say "I found a performance regression" ?
>
> > I do not currently have an easy arm64/mTHP validation
> > setup, so before trying to arrange that more expensive validation I would like
> > to ask whether the candidate fix shape below looks reasonable.
> >
> > To keep the scope clear, I am not presenting this as a production application
> > regression report or as a generic mincore() regression. It is a controlled
> > reproducer for a real userspace-visible syscall path, with the page-table shape
> > kept intentionally simple:
> >
> > mmap() private anonymous memory
> > madvise(MADV_NOHUGEPAGE)
> > fault in all pages
> > repeatedly call mincore() over a resident 64 MiB range
>
> Okay, so I assume a mincore() regression. On arm64?
>
> >
> > The practical hook is that mincore() is the userspace-visible residency query
> > for an address range. The resident anonymous no-THP range is intended to
> > isolate the base-page present-PTE scan and avoid file cache, swap, THP, marker,
> > and unmapped-range effects. I would read the result as source-path evidence for
> > the hot path below, not as evidence that every mincore() caller or a specific
> > application workload regressed.
>
> This reads very obscure and cryptic. Was that written by, or translated by an LLM?
>
> The way it's phrased makes it a bit hard to digest.
>
> >
> > The intended hot path is:
> >
> > mincore()
> > -> walk_page_range()
> > -> mincore_pte_range()
> >
> > The main metric is mincore_ns_per_1k_pages, lower is better. It is the
> > wall-clock time spent in the mincore() scan, normalized by the number of pages
> > covered by the range and reported as nanoseconds per 1000 pages scanned.
> >
> > As a release-level starting point, the matched-PREEMPT 1/2/4 CPU bridge below
> > uses original release kernels, QEMU direct boot, 9 repetitions, coverage
> > disabled, and the same CONFIG_ADVISE_SYSCALLS setup:
> >
> > scenario: no_thp_pte_scan_64m
> > metric: mincore_ns_per_1k_pages, lower is better
> >
> > CPU v6.12.77 v6.18.19 v6.19.9 v7.0.9
> > 1 12827.667 15677.444 16482.667 16726.333
> > 2 13628.444 16102.333 18256.889 17270.333
> > 4 13798.222 16739.333 18892.111 17068.222
>
> Okay, so we see two steps of "degradation". I assume this code is so performance
> sensitive that even compiler changes might easily affect it. Because all we do
> is scan page tables for present entries.
>
> The mincore optimization went into v6.16.
>
> >
> > This shows cumulative cost relative to v6.12 in the primary 1/2/4 CPU matrix.
> > I also reran the 8CPU/16CPU release-level bridge on the same scenario. These
> > rows show the same general direction, but the shared lab was busy during this
> > rerun and the high-CPU rows have higher CV, so I include them as extended
> > context only:
> >
> > CPU/mem v6.12.77 v6.18.19 v6.19.9 v7.0.9
> > 8/16 GiB 17251.889 23335.556 21863.556 21664.778
> > 16/32 GiB 16697.333 21428.333 21629.778 21628.333
>
> I don't think measuring concurrency here really makes a lot of sense.
>
> Especially, as it's becoming a rather weird, unrealistic micro-benchmark that way.
>
> >
> > The 16CPU rerun had two QEMU returncode-139 failures in the original 36-run
> > matrix; I filled the missing v6.12/v6.18 samples with a clean two-run
> > supplement. I therefore use the high-CPU rows as context for the release
> > bridge, not as part of the primary matrix.
> >
> > Follow-up release-ladder and A/B testing narrowed the main step to the
> > v6.15 -> v6.16 window. The strongest suspect is:
> >
> > 4df65651f7075 ("mm: mincore: use pte_batch_hint() to batch process large folios")
> >
> > That patch improved the mTHP/large-folio case, but in this base-page resident
> > PTE scan I see a sizeable cost. The original commit message mentioned that
> > base pages did not show an obvious regression, so this may simply be a
> > different x86/base-page corner than the original arm64/mTHP test.
>
> Okay, so it is on x86 then?
>
> On x86, pte_batch_hint() is hard-coded at 1, so the expectation is that the loop
> and everything should get completely optimized out.
>
> >
> > For the v6.16 introduction-window A/B, all rows below are lab QEMU direct boot,
> > 9 repetitions, coverage disabled, same PREEMPT and CONFIG_ADVISE_SYSCALLS
> > setup. The 1/2/4 CPU rows are the primary matrix; the 8CPU/16GiB and
> > 16CPU/32GiB rows are the high-CPU follow-up:
> >
> > scenario: no_thp_pte_scan_64m
> > metric: mincore_ns_per_1k_pages, lower is better
> >
> > CPU/mem v6.15 v6.16 v6.16 batch<=1 fastpath v6.16 nobatch
> > 1 12946.889 17117.667 14560.556 13843.222
> > 2 15053.111 18214.667 15714.778 14270.556
> > 4 14942.000 18338.222 14397.889 14719.667
> > 8/16 GiB 15046.444 17540.222 13696.333 13200.000
> > 16/32 GiB 14674.111 18928.889 13949.000 15351.111
> >
> > The high-CPU matrix completed 72/72 with all_cpu_match=true,
> > any_noapic=false, all_autorun_exit0=true, and all_semantic_ok=true. One v6.15
> > 16CPU timing sample in the main matrix was an obvious outlier, so the 16/32 GiB
> > v6.15 value above uses a clean v6.15-only 9-repeat supplement.
> >
> > I also ran ftrace attribution on the same path as mechanism evidence, not as
> > clean timing. In that run, v6.16 original had a higher mincore_pte_range
> > average than v6.15, v6.16-nobatch, and the batch<=1 fastpath:
> >
> > kernel mincore_pte_range avg_us
> > v6.15-mainline-preempt 6.040
> > v6.16-mainline-preempt 7.899
> > v6.16-mainline-nobatch 6.031
> > v6.16-mainline-fastpath 6.103
> >
> > The smaller batch<=1 fastpath helped, but later v6.18 testing suggested the
> > remaining cost was more about the hot present-PTE branch layout. The candidate
> > shape I tested is to check pte_present() first, while keeping pte_batch_hint()
> > for batch > 1:
> >
> > if (pte_present(pte)) {
> > batch = pte_batch_hint(ptep, pte);
> > if (batch > 1)
> > fill vec[0..step-1];
> > else
> > *vec = 1;
> > } else if (pte_none(pte) || pte_is_marker(pte)) {
> > __mincore_unmapped_range(...);
> > } else {
> > mincore_swap(...);
> > }
> >
> > On x86, pte_batch_hint() defaults to 1, so this mainly measures the
> > resident-PTE hot path layout. On arm64 the batch > 1 path should still be
> > preserved, but I have not validated mTHP/contiguous-PTE performance yet.
> >
> > The v6.18 confirmation A/B. The 1/2/4 CPU rows are the primary matrix; the
> > 8CPU/16GiB and 16CPU/32GiB rows are the high-CPU follow-up. All rows use the
> > same no-THP scenario, 9 repetitions, and coverage disabled:
>
>
> Which compiler are you using?
>
> The expectation is that the whole code would get optimized on x86 such that the
> behavior is just like before.
>
>
> --
> Cheers,
>
> David