Re: [v3 00/24] mm: thp: lazy PTE page table allocation at PMD split time

From: Usama Arif

Date: Wed Apr 08 2026 - 11:06:46 EST

On 06/04/2026 00:34, Hugh Dickins wrote:
> On Thu, 26 Mar 2026, Usama Arif wrote:
>
>> When the kernel creates a PMD-level THP mapping for anonymous pages, it
>> pre-allocates a PTE page table via pgtable_trans_huge_deposit(). This
>> page table sits unused in a deposit list for the lifetime of the THP
>> mapping, only to be withdrawn when the PMD is split or zapped. Every
>> anonymous THP therefore wastes 4KB of memory unconditionally. On large
>> servers where hundreds of gigabytes of memory are mapped as THPs, this
>> adds up: roughly 200MB wasted per 100GB of THP memory. This memory
>> could otherwise satisfy other allocations, including the very PTE page
>> table allocations needed when splits eventually occur.
>>
>> This series removes the pre-deposit and allocates the PTE page table
>> lazily — only when a PMD split actually happens. Since a large number
>> of THPs are never split (they are zapped wholesale when processes exit or
>> munmap the full range), the allocation is avoided entirely in the common
>> case.
>>
>> The pre-deposit pattern exists because split_huge_pmd was designed as an
>> operation that must never fail: if the kernel decides to split, it needs
>> a PTE page table, so one is deposited in advance. But "must never fail"
>> is an unnecessarily strong requirement. A PMD split is typically triggered
>> by a partial operation on a sub-PMD range — partial munmap, partial
>> mprotect, COW on a pinned folio, GUP with FOLL_SPLIT_PMD, and similar.
>> All of these operations already have well-defined error handling for
>> allocation failures (e.g., -ENOMEM, VM_FAULT_OOM). Allowing split to
>> fail and propagating the error through these existing paths is the natural
>> thing to do. Furthermore, if the system cannot satisfy a single order-0
>> allocation for a page table, it is under extreme memory pressure and
>> failing the operation is the correct response.
>>
>> Designing functions like split_huge_pmd as operations that cannot fail
>> has a subtle but real cost to code quality. It forces a pre-allocation
>> pattern - every THP creation path must deposit a page table, and every
>> split or zap path must withdraw one, creating a hidden coupling between
>> widely separated code paths.
>>
>> This also serves as a code cleanup. On every architecture except powerpc
>> with hash MMU, the deposit/withdraw machinery becomes dead code. The
>> series removes the generic implementations in pgtable-generic.c and the
>> s390/sparc overrides, replacing them with no-op stubs guarded by
>> arch_needs_pgtable_deposit(), which evaluates to false at compile time
>> on all non-powerpc architectures.
>
> I see no mention of the big problem,
> which has stopped us all from trying this before.
>
> Reclaim: the split_folio_to_list() in shrink_folio_list().
>
> Imagine a process which has forked a thousand times, containing
> anon THPs, which should now be swapped out and reclaimed.
>
> To swap out one of those THPs, it will have to allocate a
> thousand page tables, all with PF_MEMALLOC set (to give some
> access to reserves, while preventing recursion into reclaim).
>
> Elsewhere, we go to great lengths (e.g. mempools) to give
> guaranteed access to the memory needed when freeing memory.
> In the case of an anon THP, the guaranteed pool has been the
> deposited page table. Now what?
>
> And the worst is that when the 501st attempt to allocate a page
> table fails, it has allocated and is using 500 pages from reserve,
> without reaching the point of freeing any memory at all.
>
> Maybe watermark boosting (I barely know whereof I speak) can help
> a bit nowadays. Has anything else changed to solve the problem?
>
> What would help a lot would be the implementation of swap entries
> at the PMD level. Whether that would help enough, I'm sceptical:
> I do think it's foolish to depend upon the availability of huge
> contiguous swap extents, whatever the recent improvements there;
> but it would at least be an arguable justification.
>
Thanks for pointing this out. I should have thought of this as I
have been thinking about fork a lot for 1G THP and for this series.

I am working on trying to make PMD level swap entires work. I hope
to have a RFC soon.