Re: [PATCH v9 4/7] x86/mm: Simplify clear_page_*
From: Ankur Arora
Date: Thu Nov 27 2025 - 00:30:46 EST
Mateusz Guzik <mjguzik@xxxxxxxxx> writes:
> On Fri, Nov 21, 2025 at 9:24 PM Ankur Arora <ankur.a.arora@xxxxxxxxxx> wrote:
>> + * Switch between three implementations of page clearing based on CPU
>> + * capabilities:
>> + *
>> + * - __clear_pages_unrolled(): the oldest, slowest and universally
>> + * supported method. Zeroes via 8-byte MOV instructions unrolled 8x
>> + * to write a 64-byte cacheline in each loop iteration.
>> + *
>> + * - "REP; STOSQ": really old CPUs had crummy REP implementations.
>> + * Vendor CPU setup code sets 'REP_GOOD' on CPUs where REP can be
>> + * trusted. The instruction writes 8-byte per REP iteration but
>> + * CPUs can internally batch these together and do larger writes.
>> + *
>> + * - "REP; STOSB": CPUs that enumerate 'ERMS' have an improved STOS
>> + * implementation that is less picky about alignment and where
>> + * STOSB (1-byte at a time) is actually faster than STOSQ (8-bytes
>> + * at a time.)
>> + *
>
> I think this is somewhat odd commentary in this context.
>
> Note about "crummy REP implementations" should be in description of
> __clear_pages_unrolled as it justifies its existence (I think the
> routine would be best whacked btw, but I'm not going to argue about it
> in this thread).
> Description of STOSQ notes the CPU can do more than 8 bytes at a time,
> while description of STOSB claim does not make such a clarification.
> At the same time the note about less picky about alignment makes no
> significance in the context of page clearing as they are, well, page
> aligned.
Good point. I'll rework the comment a little bit to align things better
(maybe reusing some of what you suggest below).
> There is a fucky real-world problem with ERMS worth noting: there are
> hypervisor setups out there which *hide* the bit by default (no
> really, see Proxmox for example -- you get a bare bones pre-ERMS
> cpuid)
>
> With all this in mind, modulo poor grammar on my end, I would suggest
> something like this:
>
> <quote>
> There are 3 variants implemented:
> - REP; STOSB: used if the CPU supports "Enhanced REP MOVSB/STOSB" (aka
> ERMS), which is true for majority of microarchitectures today
> - REP; STOSQ: fallback if the ERMS bit is not present
> - __clear_pages_unrolled: code for CPUs which are determined to have
> poor REP support, only concerns long obsolete uarchs.
>
> Warnings: some hypervisors are configured to expose a very limited set
> of capabilites in the guest, fitering out ERMS even if present. As
> such the STOSQ variant is still in active use on some setups even when
> hardware does not need it.
> </quote>
The last bit is useful context though maybe some of it fits better in
the commit message.
Thanks
ankur