Re: [RFC] mm/mincore: present-PTE scan cost after pte_batch_hint() batching

From: Pedro Falcato

Date: Tue Jun 09 2026 - 17:14:35 EST


On Tue, Jun 09, 2026 at 04:27:50PM +0200, David Hildenbrand (Arm) wrote:
> On 6/9/26 16:12, Chengfeng Lin wrote:
> > Hi David,
>
> Hi,
>
> > Thanks, and sorry for the confusing wording. The plain statement is: I
> > observed a performance difference in a narrow x86/QEMU synthetic mincore()
> > case, and after your comment I checked whether this is really a codegen issue.
> >
> > The wording in my first mail was too abstract. What I was trying to say is
> > only that the benchmark focuses on one specific case:
> >
> > private anonymous memory
> > MADV_NOHUGEPAGE
> > faulted/resident base pages
> > repeated mincore() over the range
> >
> > so the measured path should mostly be the present-PTE scan in
> > mincore_pte_range(). I agree that the 8/16 CPU rows are not very useful for
> > this path; please treat them as extra context only. The useful data is the
> > single-threaded / low-CPU v6.15 -> v6.16 A/B and the patched variants.
> >
> > The compiler used for the lab kernels was:
> >
> > gcc (Ubuntu 13.3.0-6ubuntu2~24.04.1) 13.3.0
>
> Okay, GCC 13 was released 3 years ago.
>
> > GNU ld (GNU Binutils for Ubuntu) 2.42
> >
> > Your point about x86 pte_batch_hint() is exactly the right thing to check.
> > Since pte_batch_hint() returns 1 on x86, I agree that the expectation would be
> > for the compiler to optimize the batching logic back down to something very
> > close to the old base-page path.
> >
> > I checked the generated mincore_pte_range() code with the same GCC/config setup.
> > The function sizes from nm are:
> >
> > v6.15 original: 0x1fb
> > v6.16 original: 0x245
> > v6.16 batch<=1 fastpath: 0x1ec
> > v6.16 with batching removed: 0x1ec
> >
> > So, with GCC 13.3, the v6.16 original build does not look optimized back to the
> > old x86 base-page shape. The v6.16 batch<=1 fastpath and the v6.16 nobatch
> > variant produce the same mincore_pte_range() objdump output in my build.
> >
> > I also checked Clang 18.1.3 as a cross-check. With Clang, v6.15 original,
> > v6.16 original, v6.16 batch<=1 fastpath and v6.16 nobatch all produce the same
> > mincore_pte_range() size, 0x1f9, and the objdump output is byte-identical.
> >
> > So your expectation does hold with Clang, but not with the GCC 13.3 build I used
> > for the original lab runs. This does not prove a compiler bug, and it means my
> > original report should be narrowed: it is not a generic x86 mincore()
> > regression claim. In this check, GCC 13.3 generates a different
> > mincore_pte_range() shape for v6.16 original, while Clang 18.1.3 generates
> > byte-identical output for all checked variants. The timing signal I reported
> > came from the GCC-built QEMU lab kernels.
>
> It's probably a good idea to
>
> 1) Try with newer GCC
>
> 2) Take a look at the actual difference in the generated code
>
> Is it some inlining decisions? E.g., if the function is larger, other code is
> likely to get inlined?
>
> The function is not particularly large, so it's a bit unexpected.
>

FWIW, I don't see anything costly in the linked commit. The compiler /should/
be able to inline and constant fold everything (for near byte-identical output).
Though I vaguely recall that GCC uses pseudo-metrics (probably not LOC?) when
figuring out what to inline, not necessarily something that maps to machine
instructions directly.

So I am curious as to whether this is a compiler issue as well. Or if this
particular build just happens to hit the issue due to some random limit.

This could be worth a test, however:

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index cdd68ed3ae1a..0e0ac7138d8c 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -373,7 +373,7 @@ static inline void lazy_mmu_mode_resume(void) {}
*
* May be overridden by the architecture, else pte_batch_hint is always 1.
*/
-static inline unsigned int pte_batch_hint(pte_t *ptep, pte_t pte)
+static __always_inline unsigned int pte_batch_hint(pte_t *ptep, pte_t pte)
{
return 1;
}


--
Pedro