Re: [PATCH v4 00/13] x86/mm: Add multi-page clearing

From: Raghavendra K T
Date: Fri Jul 04 2025 - 04:29:40 EST



On 6/16/2025 10:52 AM, Ankur Arora wrote:
This series adds multi-page clearing for hugepages, improving on the
current page-at-a-time approach in two ways:

- amortizes the per-page setup cost over a larger extent
- when using string instructions, exposes the real region size to the
processor. A processor could use that as a hint to optimize based
on the full extent size. AMD Zen uarchs, as an example, elide
allocation of cachelines for regions larger than L3-size.

Demand faulting a 64GB region shows good performance improvements:

$ perf bench mem map -p $page-size -f demand -s 64GB -l 5

mm/folio_zero_user x86/folio_zero_user change
(GB/s +- %stdev) (GB/s +- %stdev)

pg-sz=2MB 11.82 +- 0.67% 16.48 +- 0.30% + 39.4%
pg-sz=1GB 17.51 +- 1.19% 40.03 +- 7.26% [#] +129.9%

[#] Only with preempt=full|lazy because cooperatively preempted models
need regular invocations of cond_resched(). This limits the extent
sizes that can be cleared as a unit.

Raghavendra also tested on AMD Genoa and that shows similar
improvements [1].

[...]
Sorry for coming back late on this:
It was nice to have it integrated to perf bench mem (easy to test :)).

I do see similar (almost same) improvement again with the rebased kernel
and patchset.
Tested only preempt=lazy and boost=1

base 6.16-rc4 + 1-9 patches of this series
patched = 6.16-rc4 + all patches

SUT: Genoa+ AMD EPYC 9B24

$ perf bench mem map -p $page-size -f populate -s 64GB -l 10
base patched change
pg-sz=2MB 12.731939 GB/sec 26.304263 GB/sec 106.6%
pg-sz=1GB 26.232423 GB/sec 61.174836 GB/sec 133.2%

for 4kb page size there is a slight improvement (mostly a noise).

Thanks and Regards
- Raghu