[RFC] mm/mincore: present-PTE scan cost after pte_batch_hint() batching

From: Chengfeng Lin

Date: Tue Jun 09 2026 - 03:33:27 EST

Hi,

I found a source-calibrated synthetic mincore() signal in the resident
base-page PTE path. I do not currently have an easy arm64/mTHP validation
setup, so before trying to arrange that more expensive validation I would like
to ask whether the candidate fix shape below looks reasonable.

To keep the scope clear, I am not presenting this as a production application
regression report or as a generic mincore() regression. It is a controlled
reproducer for a real userspace-visible syscall path, with the page-table shape
kept intentionally simple:

mmap() private anonymous memory
madvise(MADV_NOHUGEPAGE)
fault in all pages
repeatedly call mincore() over a resident 64 MiB range

The practical hook is that mincore() is the userspace-visible residency query
for an address range. The resident anonymous no-THP range is intended to
isolate the base-page present-PTE scan and avoid file cache, swap, THP, marker,
and unmapped-range effects. I would read the result as source-path evidence for
the hot path below, not as evidence that every mincore() caller or a specific
application workload regressed.

The intended hot path is:

mincore()
-> walk_page_range()
-> mincore_pte_range()

The main metric is mincore_ns_per_1k_pages, lower is better. It is the
wall-clock time spent in the mincore() scan, normalized by the number of pages
covered by the range and reported as nanoseconds per 1000 pages scanned.

As a release-level starting point, the matched-PREEMPT 1/2/4 CPU bridge below
uses original release kernels, QEMU direct boot, 9 repetitions, coverage
disabled, and the same CONFIG_ADVISE_SYSCALLS setup:

scenario: no_thp_pte_scan_64m
metric: mincore_ns_per_1k_pages, lower is better

CPU v6.12.77 v6.18.19 v6.19.9 v7.0.9
1 12827.667 15677.444 16482.667 16726.333
2 13628.444 16102.333 18256.889 17270.333
4 13798.222 16739.333 18892.111 17068.222

This shows cumulative cost relative to v6.12 in the primary 1/2/4 CPU matrix.
I also reran the 8CPU/16CPU release-level bridge on the same scenario. These
rows show the same general direction, but the shared lab was busy during this
rerun and the high-CPU rows have higher CV, so I include them as extended
context only:

CPU/mem v6.12.77 v6.18.19 v6.19.9 v7.0.9
8/16 GiB 17251.889 23335.556 21863.556 21664.778
16/32 GiB 16697.333 21428.333 21629.778 21628.333

The 16CPU rerun had two QEMU returncode-139 failures in the original 36-run
matrix; I filled the missing v6.12/v6.18 samples with a clean two-run
supplement. I therefore use the high-CPU rows as context for the release
bridge, not as part of the primary matrix.

Follow-up release-ladder and A/B testing narrowed the main step to the
v6.15 -> v6.16 window. The strongest suspect is:

4df65651f7075 ("mm: mincore: use pte_batch_hint() to batch process large folios")

That patch improved the mTHP/large-folio case, but in this base-page resident
PTE scan I see a sizeable cost. The original commit message mentioned that
base pages did not show an obvious regression, so this may simply be a
different x86/base-page corner than the original arm64/mTHP test.

For the v6.16 introduction-window A/B, all rows below are lab QEMU direct boot,
9 repetitions, coverage disabled, same PREEMPT and CONFIG_ADVISE_SYSCALLS
setup. The 1/2/4 CPU rows are the primary matrix; the 8CPU/16GiB and
16CPU/32GiB rows are the high-CPU follow-up:

scenario: no_thp_pte_scan_64m
metric: mincore_ns_per_1k_pages, lower is better

CPU/mem v6.15 v6.16 v6.16 batch<=1 fastpath v6.16 nobatch
1 12946.889 17117.667 14560.556 13843.222
2 15053.111 18214.667 15714.778 14270.556
4 14942.000 18338.222 14397.889 14719.667
8/16 GiB 15046.444 17540.222 13696.333 13200.000
16/32 GiB 14674.111 18928.889 13949.000 15351.111

The high-CPU matrix completed 72/72 with all_cpu_match=true,
any_noapic=false, all_autorun_exit0=true, and all_semantic_ok=true. One v6.15
16CPU timing sample in the main matrix was an obvious outlier, so the 16/32 GiB
v6.15 value above uses a clean v6.15-only 9-repeat supplement.

I also ran ftrace attribution on the same path as mechanism evidence, not as
clean timing. In that run, v6.16 original had a higher mincore_pte_range
average than v6.15, v6.16-nobatch, and the batch<=1 fastpath:

kernel mincore_pte_range avg_us
v6.15-mainline-preempt 6.040
v6.16-mainline-preempt 7.899
v6.16-mainline-nobatch 6.031
v6.16-mainline-fastpath 6.103

The smaller batch<=1 fastpath helped, but later v6.18 testing suggested the
remaining cost was more about the hot present-PTE branch layout. The candidate
shape I tested is to check pte_present() first, while keeping pte_batch_hint()
for batch > 1:

if (pte_present(pte)) {
batch = pte_batch_hint(ptep, pte);
if (batch > 1)
fill vec[0..step-1];
else
*vec = 1;
} else if (pte_none(pte) || pte_is_marker(pte)) {
__mincore_unmapped_range(...);
} else {
mincore_swap(...);
}

On x86, pte_batch_hint() defaults to 1, so this mainly measures the
resident-PTE hot path layout. On arm64 the batch > 1 path should still be
preserved, but I have not validated mTHP/contiguous-PTE performance yet.

The v6.18 confirmation A/B. The 1/2/4 CPU rows are the primary matrix; the
8CPU/16GiB and 16CPU/32GiB rows are the high-CPU follow-up. All rows use the
same no-THP scenario, 9 repetitions, and coverage disabled:

scenario: no_thp_pte_scan_64m
metric: mincore_ns_per_1k_pages, lower is better

CPU/mem v6.15 v6.18 v6.18 present-first mean improvement
1 13373.222 16473.000 11055.222 32.89%
2 13454.444 16424.444 11467.556 30.18%
4 13651.778 16772.333 11470.444 31.61%
8/16 GiB - 16008.778 10941.444 31.65%
16/32 GiB - 17549.556 11725.111 33.19%

And the v7.0.9 A/B, with the same row layout:

CPU/mem v7.0.9 v7.0.9 present-first mean improvement
1 16328.778 10061.778 38.38%
2 17600.000 11856.444 32.63%
4 17819.000 11961.556 32.87%
8/16 GiB 17379.778 10999.889 36.71%
16/32 GiB 17917.778 11555.889 35.51%

The high-CPU rows completed 72/72 with all_cpu_match=true,
any_noapic=false, all_autorun_exit0=true, and all_semantic_ok=true. I still
treat them as extended x86 validation because the remaining preservation
question is arm64/mTHP/contiguous-PTE behavior.

I also ran an all-scenario semantic smoke on v6.18 original vs present-first.
Both THP and no-THP scenarios completed with all_semantic_ok=true. That smoke
only checks that the THP/no-THP state shape still behaves as expected on x86;
it is not a substitute for arm64/mTHP preservation testing.

For the branch-ordering safety question, my reading is that pte_is_marker()
goes through softleaf_from_pte(), which first returns an empty leaf for
pte_present() or pte_none(). So a real marker is a non-present, non-none leaf,
and checking pte_present() first should not hide the marker path. I would
still appreciate review from people more familiar with the arch PTE encodings.

I prepared a compact evidence/reproducer bundle here:

https://github.com/lcf0399/linux-mm-regression-evidence-2026-05/tree/206a39d/mincore-present-pte-scan

It includes:

- a standalone C reproducer for the no-THP mincore scan
- the workload target/profile used by the local experiment framework
- the local test patch shape, clearly marked not for direct submission
- compact lab CSV summaries for the v6.16 intro-window A/B, including the
high-CPU follow-up
- compact lab CSV summaries for the v6.18, v7.0 and high-CPU present-first
A/B runs
- matched-PREEMPT release-level bridge summaries for the 1/2/4 CPU matrix
and the separate 8CPU/16CPU context rows

I am intentionally not asking regzbot to track this at this stage. It is a
source-calibrated synthetic signal with a strong x86 lab result across the
primary 1/2/4 CPU matrix and the 8/16CPU present-first A/B follow-up, but it
still needs arm64/mTHP validation and a proper patch before it should be
treated as an upstream-ready fix.

Does this present-first shape look like the right direction to validate further,
or would you prefer a different approach such as a smaller local fastpath around
pte_batch_hint() returning 1?

Thanks,
Chengfeng