Re: [PATCH v3 0/4] mm/folio_zero_user: add multi-page clearing

From: Ankur Arora
Date: Mon Apr 14 2025 - 15:20:22 EST



Ingo Molnar <mingo@xxxxxxxxxx> writes:

> * Ankur Arora <ankur.a.arora@xxxxxxxxxx> wrote:
>
>> We also see performance improvement for cases where this optimization is
>> unavailable (pg-sz=2MB on AMD, and pg-sz=2MB|1GB on Intel) because
>> REP; STOS is typically microcoded which can now be amortized over
>> larger regions and the hint allows the hardware prefetcher to do a
>> better job.
>>
>> Milan (EPYC 7J13, boost=0, preempt=full|lazy):
>>
>> mm/folio_zero_user x86/folio_zero_user change
>> (GB/s +- stddev) (GB/s +- stddev)
>>
>> pg-sz=1GB 16.51 +- 0.54% 42.80 +- 3.48% + 159.2%
>> pg-sz=2MB 11.89 +- 0.78% 16.12 +- 0.12% + 35.5%
>>
>> Icelakex (Platinum 8358, no_turbo=1, preempt=full|lazy):
>>
>> mm/folio_zero_user x86/folio_zero_user change
>> (GB/s +- stddev) (GB/s +- stddev)
>>
>> pg-sz=1GB 8.01 +- 0.24% 11.26 +- 0.48% + 40.57%
>> pg-sz=2MB 7.95 +- 0.30% 10.90 +- 0.26% + 37.10%
>
> How was this measured? Could you integrate this measurement as a new
> tools/perf/bench/ subcommand so that people can try it on different
> systems, etc.? There's already a 'perf bench mem' subcommand space
> where this feature could be added to.

This was a standalone trivial mmap workload similar to what qemu does
when creating a VM, really any hugetlb mmap().

x86-64-stosq (lib/memset_64.S::__memset) should have the same performance
characteristics but it uses malloc() for allocation.

For this workload we want to control the allocation path as well. Let me
see if it makes sense to extend perf bench mem memset to optionally allocate
via mmap(MAP_HUGETLB) or add a new workload under perf bench mem which
does that.

Thanks for the review!

--
ankur