Re: [RFC] mm/mincore: present-PTE scan cost after pte_batch_hint() batching

From: David Hildenbrand (Arm)

Date: Tue Jun 09 2026 - 10:42:52 EST

On 6/9/26 16:12, Chengfeng Lin wrote:
> Hi David,

Hi,

> Thanks, and sorry for the confusing wording. The plain statement is: I
> observed a performance difference in a narrow x86/QEMU synthetic mincore()
> case, and after your comment I checked whether this is really a codegen issue.
>
> The wording in my first mail was too abstract. What I was trying to say is
> only that the benchmark focuses on one specific case:
>
> private anonymous memory
> MADV_NOHUGEPAGE
> faulted/resident base pages
> repeated mincore() over the range
>
> so the measured path should mostly be the present-PTE scan in
> mincore_pte_range(). I agree that the 8/16 CPU rows are not very useful for
> this path; please treat them as extra context only. The useful data is the
> single-threaded / low-CPU v6.15 -> v6.16 A/B and the patched variants.
>
> The compiler used for the lab kernels was:
>
> gcc (Ubuntu 13.3.0-6ubuntu2~24.04.1) 13.3.0

Okay, GCC 13 was released 3 years ago.

> GNU ld (GNU Binutils for Ubuntu) 2.42
>
> Your point about x86 pte_batch_hint() is exactly the right thing to check.
> Since pte_batch_hint() returns 1 on x86, I agree that the expectation would be
> for the compiler to optimize the batching logic back down to something very
> close to the old base-page path.
>
> I checked the generated mincore_pte_range() code with the same GCC/config setup.
> The function sizes from nm are:
>
> v6.15 original: 0x1fb
> v6.16 original: 0x245
> v6.16 batch<=1 fastpath: 0x1ec
> v6.16 with batching removed: 0x1ec
>
> So, with GCC 13.3, the v6.16 original build does not look optimized back to the
> old x86 base-page shape. The v6.16 batch<=1 fastpath and the v6.16 nobatch
> variant produce the same mincore_pte_range() objdump output in my build.
>
> I also checked Clang 18.1.3 as a cross-check. With Clang, v6.15 original,
> v6.16 original, v6.16 batch<=1 fastpath and v6.16 nobatch all produce the same
> mincore_pte_range() size, 0x1f9, and the objdump output is byte-identical.
>
> So your expectation does hold with Clang, but not with the GCC 13.3 build I used
> for the original lab runs. This does not prove a compiler bug, and it means my
> original report should be narrowed: it is not a generic x86 mincore()
> regression claim. In this check, GCC 13.3 generates a different
> mincore_pte_range() shape for v6.16 original, while Clang 18.1.3 generates
> byte-identical output for all checked variants. The timing signal I reported
> came from the GCC-built QEMU lab kernels.

It's probably a good idea to

1) Try with newer GCC

2) Take a look at the actual difference in the generated code

Is it some inlining decisions? E.g., if the function is larger, other code is
likely to get inlined?

The function is not particularly large, so it's a bit unexpected.

--
Cheers,

David