Re: [PATCH] mm/page_alloc: use batch page clearing in kernel_init_pages()

From: Salunke, Hrushikesh

Date: Thu Apr 09 2026 - 05:01:40 EST

On 08-04-2026 21:02, Andrew Morton wrote:

> Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding.
>
>
> On Wed, 8 Apr 2026 16:14:03 +0530 "Salunke, Hrushikesh" <hsalunke@xxxxxxx> wrote:
>
>> kernel_init_pages() runs inside the allocator (post_alloc_hook and
>> __free_pages_prepare), so it inherits whatever context the caller is in.
>> Testing with CONFIG_DEBUG_ATOMIC_SLEEP=y and CONFIG_PROVE_LOCKING=y, I
>> hit this during exit_group() -> exit_mmap() -> __zap_vma_range, where a
>> page allocation happens while the PTE lock and RCU read lock are held,
>> making the cond_resched() in the clearing loop illegal:
>>
>> [ 1997.353228] BUG: sleeping function called from invalid context at mm/page_alloc.c:1235
>> [ 1997.353433] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 19725, name: bash
>> [ 1997.353572] preempt_count: 1, expected: 0
>> [ 1997.353706] RCU nest depth: 1, expected: 0
>> [ 1997.353837] 3 locks held by bash/19725:
>> [ 1997.353839] #0: ff38cd415971e540 (&mm->mmap_lock){++++}-{4:4}, at: exit_mmap+0x6e/0x430
>> [ 1997.353850] #1: ffffffffb03d6f60 (rcu_read_lock){....}-{1:3}, at: __pte_offset_map+0x2c/0x220
>> [ 1997.353855] #2: ff38cd410deb4618 (ptlock_ptr(ptdesc)#2){+.+.}-{3:3}, at: pte_offset_map_lock+0x92/0x170
>> [ 1997.353868] Call Trace:
>> [ 1997.353870] <TASK>
>> [ 1997.353873] dump_stack_lvl+0x91/0xb0
>> [ 1997.353877] __might_resched+0x15f/0x290
>> [ 1997.353882] kernel_init_pages+0x4b/0xa0
>> [ 1997.353886] get_page_from_freelist+0x406/0x1e60
>> [ 1997.353895] __alloc_frozen_pages_noprof+0x1d8/0x1730
>> [ 1997.353912] alloc_pages_mpol+0xa4/0x190
>> [ 1997.353917] alloc_pages_noprof+0x59/0xd0
>> [ 1997.353919] get_free_pages_noprof+0x11/0x40
>> [ 1997.353921] __tlb_remove_folio_pages_size.isra.0+0x7f/0xe0
>> [ 1997.353923] __zap_vma_range+0x1bbd/0x1f40
>> [ 1997.353931] unmap_vmas+0xd9/0x1d0
>> [ 1997.353934] exit_mmap+0x10a/0x430
>> [ 1997.353943] __mmput+0x3d/0x130
>> [ 1997.353947] do_exit+0x2a7/0xae0
> tlb_next_batch() is (fortunately) using GFP_NOWAIT. Perhaps you can
> alter your patch to not call the cond_resched() if caller is attempting
> an atomic allocation.

Thanks Vlastimil, David, Andrew, and Raghu for the reviews.

After looking into this more, I think adding cond_resched() here was
overkill. I agree that dropping cond_resched() and
PROCESS_PAGES_NON_PREEMPT_BATCH entirely and just calling clear_pages()
is the right approach. There's no case where cond_resched() in
kernel_init_pages() is both necessary and safe:

- It's unsafe in atomic context, as the BUG shows (tlb_next_batch()
allocates under PTE lock + RCU read lock via GFP_NOWAIT).
- It's unnecessary for common allocations (order-0, mTHP, 2MB) which
clear in well under 1ms.
- For 1 GiB hugepages, kernel_init_pages() only runs during the
initial admin-triggered allocation. When processes later fault on
those pages, clearing goes through folio_zero_user() ->
clear_contig_highpages(), not kernel_init_pages().

So rather than guarding cond_resched() with GFP flags (as Andrew
suggested), I'll remove it entirely in v2 to keep things simple and
same scheduling characteristics as the original code, just with the
batch clearing performance benefit.

Regarding the 512 MiB arm64 case that David mentioned the stall from
clearing that without cond_resched() under PREEMPT_NONE is acceptable,
or should it be handled differently?

I can introduce clear_highpages_kasan_tagged() / clear_highpages()
helpers, or keep v2 minimal with the logic inline in
kernel_init_pages(). Any preference?

I'll test v2 across preempt=none,voluntary,full,auto with
init_on_alloc=1 and CONFIG_DEBUG_ATOMIC_SLEEP=y before sending.

Regards,
Hrushikesh