This series adds multi-page clearing for hugepages, improving on the[...]
current page-at-a-time approach in two ways:
- amortizes the per-page setup cost over a larger extent
- when using string instructions, exposes the real region size to the
processor. A processor could use that as a hint to optimize based
on the full extent size. AMD Zen uarchs, as an example, elide
allocation of cachelines for regions larger than L3-size.
Demand faulting a 64GB region shows good performance improvements:
$ perf bench mem map -p $page-size -f demand -s 64GB -l 5
mm/folio_zero_user x86/folio_zero_user change
(GB/s +- %stdev) (GB/s +- %stdev)
pg-sz=2MB 11.82 +- 0.67% 16.48 +- 0.30% + 39.4%
pg-sz=1GB 17.51 +- 1.19% 40.03 +- 7.26% [#] +129.9%
[#] Only with preempt=full|lazy because cooperatively preempted models
need regular invocations of cond_resched(). This limits the extent
sizes that can be cleared as a unit.
Raghavendra also tested on AMD Genoa and that shows similar
improvements [1].