Re: Re: [RFC] mm/mincore: present-PTE scan cost after pte_batch_hint() batching

From: Chengfeng Lin

Date: Wed Jun 10 2026 - 05:36:13 EST

Hi Barry,

Thanks a lot for testing it on bare metal.

Your result is useful. Since your Xeon results show almost no difference
between base and the revert, I agree that the timing signal I saw should not be
treated as a real mincore() regression. The static GCC codegen difference is
still there, but it does not appear to translate into a measurable bare-metal
regression in your test.

I will not pursue this further as a regression report unless I can reproduce it
on bare metal.

Thanks,
Chengfeng

> -----原始邮件-----
> 发件人: "Baolin Wang" <baolin.wang@xxxxxxxxxxxxxxxxx>
> 发送时间:2026-06-10 16:45:05 (星期三)
> 收件人: "Chengfeng Lin" <chengfenglin@xxxxxxxxxxxxxx>, "Pedro Falcato" <pfalcato@xxxxxxx>
> 抄送: "David Hildenbrand (Arm)" <david@xxxxxxxxxx>, "Andrew Morton" <akpm@xxxxxxxxxxxxxxxxxxxx>, "Liam R. Howlett" <liam@xxxxxxxxxxxxx>, "Lorenzo Stoakes" <ljs@xxxxxxxxxx>, "Vlastimil Babka" <vbabka@xxxxxxxxxx>, "Jann Horn" <jannh@xxxxxxxxxx>, linux-mm@xxxxxxxxx, linux-kernel@xxxxxxxxxxxxxxx, "Barry Song" <baohua@xxxxxxxxxx>, "Dev Jain" <dev.jain@xxxxxxx>, "Ryan Roberts" <ryan.roberts@xxxxxxx>, "Zi Yan" <ziy@xxxxxxxxxx>
> 主题: Re: [RFC] mm/mincore: present-PTE scan cost after pte_batch_hint() batching
>
>
>
> On 6/10/26 3:20 PM, Chengfeng Lin wrote:
> > Hi Pedro, David,
> >
> > Thanks. I tried two newer GCC builds as well, using the same base config and
> > the same mm/mincore.o build target.
> >
> > The GCC 14.2 sizes were:
> >
> > v6.15 original: 0x1e8
> > v6.16 original: 0x229
> > v6.16 batch<=1 fastpath: 0x1d9
> > v6.16 with batching removed: 0x1d9
> >
> > I also tried GCC 15.2 from the Ubuntu 25.10 (Questing) packages, extracted
> > locally rather than installed system-wide:
> >
> > v6.15 original: 0x1e0
> > v6.16 original: 0x221
> > v6.16 batch<=1 fastpath: 0x1d1
> > v6.16 with batching removed: 0x1d1
> >
> > So GCC 14.2 and GCC 15.2 match the GCC 13.3 direction: v6.16 original still
> > does not collapse to the old x86 base-page shape, while the batch<=1 fastpath
> > and the nobatch variant produce byte-identical `mincore_pte_range()` objdump
> > output. Clang 18.1.3 behaves differently: all four checked variants produce
> > byte-identical `mincore_pte_range()` objdump output there.
> >
> > I also looked at the generated-code difference. It does not look like an
> > obvious extra inlining decision. With GCC 13.3, GCC 14.2 and GCC 15.2, the
> > original, fastpath, and nobatch builds all have the same external
> > call/relocation targets:
> >
> > __pte_offset_map_lock
> > __pmd_trans_huge_lock
> > memset
> > __cond_resched
> > filemap_get_incore_folio
> > __folio_put
> > swapper_spaces
> >
> > For GCC 15.2, I also saved focused optimization-dump snippets for
> > mincore_pte_range(). They show the same direction before final assembly: the
> > original build has more GCC intermediate code than the fastpath/nobatch builds
> > (321 optimized-dump lines and 993 RTL-expand lines, compared with 299 and
> > 921).
> >
> > The visible difference is the PTE-loop layout. GCC keeps the v6.16 original
> > `step`-based batching shape in the generated code, so the present-PTE hot path
> > has an extra split/jump through a common advance block. The nobatch and
> > batch<=1 fastpath builds produce a more compact, identical
> > `mincore_pte_range()` objdump.
> >
> > After Pedro's suggestion, I also tried forcing the default helper to
> > `__always_inline`:
> >
> > static __always_inline unsigned int pte_batch_hint(pte_t *ptep, pte_t pte)
> >
> > That did not change the result in this setup. For GCC 13.3, GCC 14.2 and
> > GCC 15.2, the always-inline variant has the same size and normalized
> > `mincore_pte_range()` objdump as v6.16 original, not the batch<=1 fastpath /
> > nobatch shape.
> >
> > I put the updated GCC14/GCC15 nm/objdump files, focused dump snippets, and a
> > short side-by-side block note here:
> >
> > https://github.com/lcf0399/linux-mm-regression-evidence-2026-05/tree/444d66373b7be32f7a06b52b3d9ced7c2c53264f/mincore-present-pte-scan/codegen
> >
> > One caveat: I have not rerun the timing part on bare metal. The timing signal
> > I reported is still from the QEMU direct-boot lab, so I do not want to
> > overstate that part. The codegen comparison above is static `mm/mincore.o`
> > evidence and is not QEMU-dependent.
> >
> > So the narrower result is that newer GCC did not make the codegen difference
> > disappear in this check.
> >
> > If there is another specific compiler/codegen detail that would be useful to
> > check, I can look at that.
>
> I quickly tested using your test cases (haven't reviewed your test code
> yet) on my x86 Xeon machine, and I didn't observe any obvious regression
> (tested 3 times, taking the average), and the data shows some noisy
> fluctuations:
>
> test cases: ./mincore_present_pte_scan no_thp_pte_scan_64m 1
> metric: mincore_ns_per_1k_pages, lower is better
>
> base revert 4df65651f7075
> 2675 2670
>
> I still suspect it's an issue with your QEMU test environment. I'd
> suggest retesting on bare metals.