Re: [RFC v2 00/21] mm: thp: lazy PTE page table allocation at PMD split

From: Usama Arif

Date: Fri Feb 27 2026 - 06:14:19 EST

On 26/02/2026 21:01, Nico Pache wrote:
> On Thu, Feb 26, 2026 at 4:33 AM Usama Arif <usama.arif@xxxxxxxxx> wrote:
>>
>> When the kernel creates a PMD-level THP mapping for anonymous pages, it
>> pre-allocates a PTE page table via pgtable_trans_huge_deposit(). This
>> page table sits unused in a deposit list for the lifetime of the THP
>> mapping, only to be withdrawn when the PMD is split or zapped. Every
>> anonymous THP therefore wastes 4KB of memory unconditionally. On large
>> servers where hundreds of gigabytes of memory are mapped as THPs, this
>> adds up: roughly 200MB wasted per 100GB of THP memory. This memory
>> could otherwise satisfy other allocations, including the very PTE page
>> table allocations needed when splits eventually occur.
>>
>> This series removes the pre-deposit and allocates the PTE page table
>> lazily — only when a PMD split actually happens. Since a large number
>> of THPs are never split (they are zapped wholesale when processes exit or
>> munmap the full range), the allocation is avoided entirely in the common
>> case.
>>
>> The pre-deposit pattern exists because split_huge_pmd was designed as an
>> operation that must never fail: if the kernel decides to split, it needs
>> a PTE page table, so one is deposited in advance. But "must never fail"
>> is an unnecessarily strong requirement. A PMD split is typically triggered
>> by a partial operation on a sub-PMD range — partial munmap, partial
>> mprotect, partial mremap and so on.
>> Most of these operations already have well-defined error handling for
>> allocation failures (e.g., -ENOMEM, VM_FAULT_OOM). Allowing split to
>> fail and propagating the error through these existing paths is the natural
>> thing to do. Furthermore, split failing requires an order-0 allocation for
>> a page table to fail, which is extremely unlikely.
>>
>> Designing functions like split_huge_pmd as operations that cannot fail
>> has a subtle but real cost to code quality. It forces a pre-allocation
>> pattern - every THP creation path must deposit a page table, and every
>> split or zap path must withdraw one, creating a hidden coupling between
>> widely separated code paths.
>>
>> This also serves as a code cleanup. On every architecture except powerpc
>> with hash MMU, the deposit/withdraw machinery becomes dead code. The
>> series removes the generic implementations in pgtable-generic.c and the
>> s390/sparc overrides, replacing them with no-op stubs guarded by
>> arch_needs_pgtable_deposit(), which evaluates to false at compile time
>> on all non-powerpc architectures.
>
> Hi Usama,
>
> Thanks for tackling this, it seems like an interesting problem. Im
> trying to get more into reviewing, so bare with me I may have some
> stupid comments or questions. Where I can really help out is with
> testing. I will build this for all RH-supported architectures and run
> some automated test suites and performance metrics. I'll report back
> if I spot anything.
>
> Cheers!
> -- Nico
>

Thanks for the build and looking into reviewing this. All comments
and questions are welcome! I had only tested on x86, and I had a look
at the link you shared so its great to know that powerPC and s390 are fine.