Re: Re: [RFC] mm/mincore: present-PTE scan cost after pte_batch_hint() batching
From: Chengfeng Lin
Date: Wed Jun 10 2026 - 03:21:12 EST
Hi Pedro, David,
Thanks. I tried two newer GCC builds as well, using the same base config and
the same mm/mincore.o build target.
The GCC 14.2 sizes were:
v6.15 original: 0x1e8
v6.16 original: 0x229
v6.16 batch<=1 fastpath: 0x1d9
v6.16 with batching removed: 0x1d9
I also tried GCC 15.2 from the Ubuntu 25.10 (Questing) packages, extracted
locally rather than installed system-wide:
v6.15 original: 0x1e0
v6.16 original: 0x221
v6.16 batch<=1 fastpath: 0x1d1
v6.16 with batching removed: 0x1d1
So GCC 14.2 and GCC 15.2 match the GCC 13.3 direction: v6.16 original still
does not collapse to the old x86 base-page shape, while the batch<=1 fastpath
and the nobatch variant produce byte-identical `mincore_pte_range()` objdump
output. Clang 18.1.3 behaves differently: all four checked variants produce
byte-identical `mincore_pte_range()` objdump output there.
I also looked at the generated-code difference. It does not look like an
obvious extra inlining decision. With GCC 13.3, GCC 14.2 and GCC 15.2, the
original, fastpath, and nobatch builds all have the same external
call/relocation targets:
__pte_offset_map_lock
__pmd_trans_huge_lock
memset
__cond_resched
filemap_get_incore_folio
__folio_put
swapper_spaces
For GCC 15.2, I also saved focused optimization-dump snippets for
mincore_pte_range(). They show the same direction before final assembly: the
original build has more GCC intermediate code than the fastpath/nobatch builds
(321 optimized-dump lines and 993 RTL-expand lines, compared with 299 and
921).
The visible difference is the PTE-loop layout. GCC keeps the v6.16 original
`step`-based batching shape in the generated code, so the present-PTE hot path
has an extra split/jump through a common advance block. The nobatch and
batch<=1 fastpath builds produce a more compact, identical
`mincore_pte_range()` objdump.
After Pedro's suggestion, I also tried forcing the default helper to
`__always_inline`:
static __always_inline unsigned int pte_batch_hint(pte_t *ptep, pte_t pte)
That did not change the result in this setup. For GCC 13.3, GCC 14.2 and
GCC 15.2, the always-inline variant has the same size and normalized
`mincore_pte_range()` objdump as v6.16 original, not the batch<=1 fastpath /
nobatch shape.
I put the updated GCC14/GCC15 nm/objdump files, focused dump snippets, and a
short side-by-side block note here:
https://github.com/lcf0399/linux-mm-regression-evidence-2026-05/tree/444d66373b7be32f7a06b52b3d9ced7c2c53264f/mincore-present-pte-scan/codegen
One caveat: I have not rerun the timing part on bare metal. The timing signal
I reported is still from the QEMU direct-boot lab, so I do not want to
overstate that part. The codegen comparison above is static `mm/mincore.o`
evidence and is not QEMU-dependent.
So the narrower result is that newer GCC did not make the codegen difference
disappear in this check.
If there is another specific compiler/codegen detail that would be useful to
check, I can look at that.
Thanks,
Chengfeng
> -----原始邮件-----
> 发件人: "Pedro Falcato" <pfalcato@xxxxxxx>
> 发送时间:2026-06-10 05:12:33 (星期三)
> 收件人: "David Hildenbrand (Arm)" <david@xxxxxxxxxx>
> 抄送: "Chengfeng Lin" <chengfenglin@xxxxxxxxxxxxxx>, "Andrew Morton" <akpm@xxxxxxxxxxxxxxxxxxxx>, "Liam R. Howlett" <liam@xxxxxxxxxxxxx>, "Lorenzo Stoakes" <ljs@xxxxxxxxxx>, "Vlastimil Babka" <vbabka@xxxxxxxxxx>, "Jann Horn" <jannh@xxxxxxxxxx>, linux-mm@xxxxxxxxx, linux-kernel@xxxxxxxxxxxxxxx, "Baolin Wang" <baolin.wang@xxxxxxxxxxxxxxxxx>, "Barry Song" <baohua@xxxxxxxxxx>, "Dev Jain" <dev.jain@xxxxxxx>, "Ryan Roberts" <ryan.roberts@xxxxxxx>, "Zi Yan" <ziy@xxxxxxxxxx>
> 主题: Re: [RFC] mm/mincore: present-PTE scan cost after pte_batch_hint() batching
>
> On Tue, Jun 09, 2026 at 04:27:50PM +0200, David Hildenbrand (Arm) wrote:
> > On 6/9/26 16:12, Chengfeng Lin wrote:
> > > Hi David,
> >
> > Hi,
> >
> > > Thanks, and sorry for the confusing wording. The plain statement is: I
> > > observed a performance difference in a narrow x86/QEMU synthetic mincore()
> > > case, and after your comment I checked whether this is really a codegen issue.
> > >
> > > The wording in my first mail was too abstract. What I was trying to say is
> > > only that the benchmark focuses on one specific case:
> > >
> > > private anonymous memory
> > > MADV_NOHUGEPAGE
> > > faulted/resident base pages
> > > repeated mincore() over the range
> > >
> > > so the measured path should mostly be the present-PTE scan in
> > > mincore_pte_range(). I agree that the 8/16 CPU rows are not very useful for
> > > this path; please treat them as extra context only. The useful data is the
> > > single-threaded / low-CPU v6.15 -> v6.16 A/B and the patched variants.
> > >
> > > The compiler used for the lab kernels was:
> > >
> > > gcc (Ubuntu 13.3.0-6ubuntu2~24.04.1) 13.3.0
> >
> > Okay, GCC 13 was released 3 years ago.
> >
> > > GNU ld (GNU Binutils for Ubuntu) 2.42
> > >
> > > Your point about x86 pte_batch_hint() is exactly the right thing to check.
> > > Since pte_batch_hint() returns 1 on x86, I agree that the expectation would be
> > > for the compiler to optimize the batching logic back down to something very
> > > close to the old base-page path.
> > >
> > > I checked the generated mincore_pte_range() code with the same GCC/config setup.
> > > The function sizes from nm are:
> > >
> > > v6.15 original: 0x1fb
> > > v6.16 original: 0x245
> > > v6.16 batch<=1 fastpath: 0x1ec
> > > v6.16 with batching removed: 0x1ec
> > >
> > > So, with GCC 13.3, the v6.16 original build does not look optimized back to the
> > > old x86 base-page shape. The v6.16 batch<=1 fastpath and the v6.16 nobatch
> > > variant produce the same mincore_pte_range() objdump output in my build.
> > >
> > > I also checked Clang 18.1.3 as a cross-check. With Clang, v6.15 original,
> > > v6.16 original, v6.16 batch<=1 fastpath and v6.16 nobatch all produce the same
> > > mincore_pte_range() size, 0x1f9, and the objdump output is byte-identical.
> > >
> > > So your expectation does hold with Clang, but not with the GCC 13.3 build I used
> > > for the original lab runs. This does not prove a compiler bug, and it means my
> > > original report should be narrowed: it is not a generic x86 mincore()
> > > regression claim. In this check, GCC 13.3 generates a different
> > > mincore_pte_range() shape for v6.16 original, while Clang 18.1.3 generates
> > > byte-identical output for all checked variants. The timing signal I reported
> > > came from the GCC-built QEMU lab kernels.
> >
> > It's probably a good idea to
> >
> > 1) Try with newer GCC
> >
> > 2) Take a look at the actual difference in the generated code
> >
> > Is it some inlining decisions? E.g., if the function is larger, other code is
> > likely to get inlined?
> >
> > The function is not particularly large, so it's a bit unexpected.
> >
>
> FWIW, I don't see anything costly in the linked commit. The compiler /should/
> be able to inline and constant fold everything (for near byte-identical output).
> Though I vaguely recall that GCC uses pseudo-metrics (probably not LOC?) when
> figuring out what to inline, not necessarily something that maps to machine
> instructions directly.
>
> So I am curious as to whether this is a compiler issue as well. Or if this
> particular build just happens to hit the issue due to some random limit.
>
> This could be worth a test, however:
>
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index cdd68ed3ae1a..0e0ac7138d8c 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -373,7 +373,7 @@ static inline void lazy_mmu_mode_resume(void) {}
> *
> * May be overridden by the architecture, else pte_batch_hint is always 1.
> */
> -static inline unsigned int pte_batch_hint(pte_t *ptep, pte_t pte)
> +static __always_inline unsigned int pte_batch_hint(pte_t *ptep, pte_t pte)
> {
> return 1;
> }
>
>
> --
> Pedro