Re: [RFC v2 00/21] mm: thp: lazy PTE page table allocation at PMD split

From: Nico Pache

Date: Fri Feb 27 2026 - 19:07:35 EST

On Fri, Feb 27, 2026 at 4:14 AM Usama Arif <usama.arif@xxxxxxxxx> wrote:
>
>
>
> On 26/02/2026 21:01, Nico Pache wrote:
> > On Thu, Feb 26, 2026 at 4:33 AM Usama Arif <usama.arif@xxxxxxxxx> wrote:
> >>
> >> When the kernel creates a PMD-level THP mapping for anonymous pages, it
> >> pre-allocates a PTE page table via pgtable_trans_huge_deposit(). This
> >> page table sits unused in a deposit list for the lifetime of the THP
> >> mapping, only to be withdrawn when the PMD is split or zapped. Every
> >> anonymous THP therefore wastes 4KB of memory unconditionally. On large
> >> servers where hundreds of gigabytes of memory are mapped as THPs, this
> >> adds up: roughly 200MB wasted per 100GB of THP memory. This memory
> >> could otherwise satisfy other allocations, including the very PTE page
> >> table allocations needed when splits eventually occur.
> >>
> >> This series removes the pre-deposit and allocates the PTE page table
> >> lazily — only when a PMD split actually happens. Since a large number
> >> of THPs are never split (they are zapped wholesale when processes exit or
> >> munmap the full range), the allocation is avoided entirely in the common
> >> case.
> >>
> >> The pre-deposit pattern exists because split_huge_pmd was designed as an
> >> operation that must never fail: if the kernel decides to split, it needs
> >> a PTE page table, so one is deposited in advance. But "must never fail"
> >> is an unnecessarily strong requirement. A PMD split is typically triggered
> >> by a partial operation on a sub-PMD range — partial munmap, partial
> >> mprotect, partial mremap and so on.
> >> Most of these operations already have well-defined error handling for
> >> allocation failures (e.g., -ENOMEM, VM_FAULT_OOM). Allowing split to
> >> fail and propagating the error through these existing paths is the natural
> >> thing to do. Furthermore, split failing requires an order-0 allocation for
> >> a page table to fail, which is extremely unlikely.
> >>
> >> Designing functions like split_huge_pmd as operations that cannot fail
> >> has a subtle but real cost to code quality. It forces a pre-allocation
> >> pattern - every THP creation path must deposit a page table, and every
> >> split or zap path must withdraw one, creating a hidden coupling between
> >> widely separated code paths.
> >>
> >> This also serves as a code cleanup. On every architecture except powerpc
> >> with hash MMU, the deposit/withdraw machinery becomes dead code. The
> >> series removes the generic implementations in pgtable-generic.c and the
> >> s390/sparc overrides, replacing them with no-op stubs guarded by
> >> arch_needs_pgtable_deposit(), which evaluates to false at compile time
> >> on all non-powerpc architectures.
> >
> > Hi Usama,
> >
> > Thanks for tackling this, it seems like an interesting problem. Im
> > trying to get more into reviewing, so bare with me I may have some
> > stupid comments or questions. Where I can really help out is with
> > testing. I will build this for all RH-supported architectures and run
> > some automated test suites and performance metrics. I'll report back
> > if I spot anything.
> >
> > Cheers!
> > -- Nico
> >
>
> Thanks for the build and looking into reviewing this. All comments
> and questions are welcome! I had only tested on x86, and I had a look
> at the link you shared so its great to know that powerPC and s390 are fine.

Good news: as you noted all the builds succeeded, and the sanity tests
dont show any signs of an immediate issue across the architectures.
I'll proceed to debug kernels, and then performance testing. I will
try to start reviewing the actual code changes in depth next week :)

Cheers,
-- Nico

>