Re: [PATCH v3 4/4] x86/folio_zero_user: multi-page clearing

From: Ankur Arora
Date: Mon Apr 14 2025 - 17:22:32 EST



Ingo Molnar <mingo@xxxxxxxxxx> writes:

> * Ankur Arora <ankur.a.arora@xxxxxxxxxx> wrote:
>
>> clear_pages_rep(), clear_pages_erms() use string instructions to zero
>> memory. When operating on more than a single page, we can use these
>> more effectively by explicitly advertising the region-size to the
>> processor, which can use that as a hint to optimize the clearing
>> (ex. by eliding cacheline allocation.)
>
>> +#ifndef CONFIG_HIGHMEM
>> +/*
>> + * folio_zero_user_preemptible(): multi-page clearing variant of folio_zero_user().
>> + *
>> + * Taking inspiration from the common code variant, we split the zeroing in
>> + * three parts: left of the fault, right of the fault, and up to 5 pages
>> + * in the immediate neighbourhood of the target page.
>> + *
>> + * Cleared in that order to keep cache lines of the target region hot.
>> + *
>> + * For gigantic pages, there is no expectation of cache locality so just do a
>> + * straight zero.
>> + */
>> +void folio_zero_user_preemptible(struct folio *folio, unsigned long addr_hint)
>> +{
>> + unsigned long base_addr = ALIGN_DOWN(addr_hint, folio_size(folio));
>> + const long fault_idx = (addr_hint - base_addr) / PAGE_SIZE;
>> + const struct range pg = DEFINE_RANGE(0, folio_nr_pages(folio) - 1);
>> + int width = 2; /* pages cleared last on either side */
>> + struct range r[3];
>> + int i;
>> +
>> + if (folio_nr_pages(folio) > MAX_ORDER_NR_PAGES) {
>> + clear_pages(page_address(folio_page(folio, 0)), folio_nr_pages(folio));
>
>> + clear_pages(page_address(folio_page(folio, r[i].start)), len);
>
> So the _user postfix naming is super confusing here and elsewhere in
> this series.

The problem is that the _user naming comes from the MM interface name and
is meant to address architectures where you might need to do more than
just zero the kernel address range for the page.

> clear_page(), and by extension the clear_pages() interface you extended
> it to, fundamentally only works on kernel addresses:

Agreed.

> /*
> * Zero a page.
> * %rdi - page
> */
> SYM_TYPED_FUNC_START(clear_page_rep)
> movl $4096/8,%ecx
> xorl %eax,%eax
> rep stosq
> RET
>
> Note the absolute lack of fault & exception handling.

Yeah. And, as you are implying that is safe because the folio_zero_user()
(and this path) is only called after this range has been validated.

> But folio_zero_user*() uses the kernel-space variants of page clearing
> AFAICT (contrary to the naming):
>
> void folio_zero_user(struct folio *folio, unsigned long addr_hint)
> {
> unsigned int nr_pages = folio_nr_pages(folio);
>
> if (unlikely(nr_pages > MAX_ORDER_NR_PAGES))
> clear_gigantic_page(folio, addr_hint, nr_pages);
> else
> process_huge_page(addr_hint, nr_pages, clear_subpage, folio);
> }
>
>
> static void clear_gigantic_page(struct folio *folio, unsigned long addr_hint, > unsigned int nr_pages)
> {
> unsigned long addr = ALIGN_DOWN(addr_hint, folio_size(folio));
> int i;
>
> might_sleep();
> for (i = 0; i < nr_pages; i++) {
> cond_resched();
> clear_user_highpage(folio_page(folio, i), addr + i * PAGE_SIZE);
> }
> }
>
> Which on x86 is simply mapped into a kernel-memory interface:
>
> static inline void clear_user_page(void *page, unsigned long vaddr,
> struct page *pg)
> {
> clear_page(page);
> }
>
> So at minimum this is a misnomer and a confusing mixture of user/kernel
> interface names on an epic scale that TBH should be cleaned up first
> before extended...

I think a comment to avoid this confusion is definitely warranted. About
the mixture of names, I'm not sure how to avoid that. For instance see
arch/arc/mm/cache.c::clear_user_page()::

void clear_user_page(void *to, unsigned long u_vaddr, struct page *page)
{
struct folio *folio = page_folio(page);
clear_page(to);
clear_bit(PG_dc_clean, &folio->flags);
}

arch/arm also does a bunch of stuff which made my head hurt but the arc
version is clearly different enough.

>> +out:
>> + /* Explicitly invoke cond_resched() to handle any live patching necessary. */
>> + cond_resched();
>
> What again?

Yeah, I can see how this looks out of place :). The idea was that even
though we don't need explicit invocations of cond_resched() (because
this path is only called when preemptible), we still need some because
cond_resched() is overloaded to help with live patching.

Anyway, this comment can go away based on your suggestion elsewhere
(extensions for cooperative preemption models.)

Thanks for the detailed review.

--
ankur