Re: [RFC] mm/mincore: present-PTE scan cost after pte_batch_hint() batching

From: Baolin Wang

Date: Wed Jun 10 2026 - 04:52:44 EST

On 6/10/26 3:20 PM, Chengfeng Lin wrote:

Hi Pedro, David,

Thanks. I tried two newer GCC builds as well, using the same base config and
the same mm/mincore.o build target.

The GCC 14.2 sizes were:

v6.15 original: 0x1e8
v6.16 original: 0x229
v6.16 batch<=1 fastpath: 0x1d9
v6.16 with batching removed: 0x1d9

I also tried GCC 15.2 from the Ubuntu 25.10 (Questing) packages, extracted
locally rather than installed system-wide:

v6.15 original: 0x1e0
v6.16 original: 0x221
v6.16 batch<=1 fastpath: 0x1d1
v6.16 with batching removed: 0x1d1

So GCC 14.2 and GCC 15.2 match the GCC 13.3 direction: v6.16 original still
does not collapse to the old x86 base-page shape, while the batch<=1 fastpath
and the nobatch variant produce byte-identical `mincore_pte_range()` objdump
output. Clang 18.1.3 behaves differently: all four checked variants produce
byte-identical `mincore_pte_range()` objdump output there.

I also looked at the generated-code difference. It does not look like an
obvious extra inlining decision. With GCC 13.3, GCC 14.2 and GCC 15.2, the
original, fastpath, and nobatch builds all have the same external
call/relocation targets:

__pte_offset_map_lock
__pmd_trans_huge_lock
memset
__cond_resched
filemap_get_incore_folio
__folio_put
swapper_spaces

For GCC 15.2, I also saved focused optimization-dump snippets for
mincore_pte_range(). They show the same direction before final assembly: the
original build has more GCC intermediate code than the fastpath/nobatch builds
(321 optimized-dump lines and 993 RTL-expand lines, compared with 299 and
921).

The visible difference is the PTE-loop layout. GCC keeps the v6.16 original
`step`-based batching shape in the generated code, so the present-PTE hot path
has an extra split/jump through a common advance block. The nobatch and
batch<=1 fastpath builds produce a more compact, identical
`mincore_pte_range()` objdump.

After Pedro's suggestion, I also tried forcing the default helper to
`__always_inline`:

static __always_inline unsigned int pte_batch_hint(pte_t *ptep, pte_t pte)

That did not change the result in this setup. For GCC 13.3, GCC 14.2 and
GCC 15.2, the always-inline variant has the same size and normalized
`mincore_pte_range()` objdump as v6.16 original, not the batch<=1 fastpath /
nobatch shape.

I put the updated GCC14/GCC15 nm/objdump files, focused dump snippets, and a
short side-by-side block note here:

https://github.com/lcf0399/linux-mm-regression-evidence-2026-05/tree/444d66373b7be32f7a06b52b3d9ced7c2c53264f/mincore-present-pte-scan/codegen

One caveat: I have not rerun the timing part on bare metal. The timing signal
I reported is still from the QEMU direct-boot lab, so I do not want to
overstate that part. The codegen comparison above is static `mm/mincore.o`
evidence and is not QEMU-dependent.

So the narrower result is that newer GCC did not make the codegen difference
disappear in this check.

If there is another specific compiler/codegen detail that would be useful to
check, I can look at that.

I quickly tested using your test cases (haven't reviewed your test code yet) on my x86 Xeon machine, and I didn't observe any obvious regression (tested 3 times, taking the average), and the data shows some noisy fluctuations:

test cases: ./mincore_present_pte_scan no_thp_pte_scan_64m 1
metric: mincore_ns_per_1k_pages, lower is better

base revert 4df65651f7075
2675 2670

I still suspect it's an issue with your QEMU test environment. I'd suggest retesting on bare metals.