Re: [PATCH v3] mm/page_alloc: replace kernel_init_pages() with batch page clearing

From: Salunke, Hrushikesh

Date: Thu Apr 23 2026 - 01:09:21 EST

On 22-04-2026 23:55, David Hildenbrand (Arm) wrote:
> Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding.
>
>
> On 4/22/26 12:26, Hrushikesh Salunke wrote:
>> When init_on_alloc is enabled, kernel_init_pages() clears every page
>> one at a time via clear_highpage_kasan_tagged(), which incurs per-page
>> kmap_local_page()/kunmap_local() overhead and prevents the architecture
>> clearing primitive from operating on contiguous ranges.
>>
>> Introduce clear_highpages_kasan_tagged() in highmem.h, a batch
>> clearing helper that calls clear_pages() for the full contiguous range
>> on !HIGHMEM systems, bypassing the per-page kmap overhead and allowing
>> a single invocation of the arch clearing primitive across the entire
>> allocation. The HIGHMEM path falls back to per-page clearing since
>> those pages require kmap.
>>
>> Replace kernel_init_pages() with direct calls to the new helper, as it
>> becomes a trivial wrapper.
>>
>> Allocating 8192 x 2MB HugeTLB pages (16GB) with init_on_alloc=1:
>>
>> Before: 0.445s
>> After: 0.166s (-62.7%, 2.68x faster)
>>
>> Kernel time (sys) reduction per workload with init_on_alloc=1:
>>
>> Workload Before After Change
>> Graph500 64C128T 30m 41.8s 15m 14.8s -50.3%
>> Graph500 16C32T 15m 56.7s 9m 43.7s -39.0%
>> Pagerank 32T 1m 58.5s 1m 12.8s -38.5%
>> Pagerank 128T 2m 36.3s 1m 40.4s -35.7%
> We do have some elaborate handling in clear_contig_highpages() to chunk it up
> (and to call cond_resched()). But that function can get called with much bigger
> ranges.
>
> I'm not concerned about the cond_resched() -- we wouldn't do one here before --
> but I'm wondering whether we could end up triggering a HW instruction that is
> uninterruptible and takes a rather long time.
>
> But clear_contig_highpages() breaks it into 32MiB chunks, and only x86 supports
> it so far. So we won't exceed that with the maximum buddy order of 4MiB on x86.
>
> Acked-by: David Hildenbrand (Arm) <david@xxxxxxxxxx>
>
> --
> Cheers,
>
> David

Right, on x86 the max buddy order keeps it well within safe limits.

Also, rep stosb/stosq on x86, currently used for clearing, is
interruptible, the CPU can take interrupts between iterations and
resume where it left off. So even for larger ranges it wouldn't be a
single uninterruptible operation. Other architectures use a per-page
loop for clearing, so the same applies there.

Thanks for the Ack!

Regards,
Hrushikesh