Re: [PATCH 00/13] mm: PMD-level swap entries for anonymous THPs
From: Usama Arif
Date: Wed Apr 29 2026 - 05:51:52 EST
On 28/04/2026 20:54, David Hildenbrand (Arm) wrote:
> On 4/27/26 12:01, Usama Arif wrote:
>> When reclaim swaps out a PMD-mapped anonymous THP today, the PMD is
>> split into 512 PTE-level swap entries via TTU_SPLIT_HUGE_PMD before
>> unmap.
>>
>> This series introduces a PMD-level swap entry. The huge mapping is
>> preserved across the swap round-trip, and do_huge_pmd_swap_page()
>> resolves the entire 2 MB region in a single fault on swap-in,
>> no khugepaged involvement is needed. swap_map metadata is identical
>> either way (512 single-slot counts), so the PTE split buys nothing
>> on the swap side, it is purely a page-table representation change.
>>
>> This work was brought about after Hugh reported that one of the
>> major blockers for having lazy page table deposit is the lack of
>> PMD swap entries [1]. However, this series has benefits of its
>> own:
>> - The huge mapping is restored on swap-in. Today even when the
>> folio is still in swap cache as a single 2 MB folio, the swap-in
>> path installs 512 PTE mappings -- the PMD mapping is gone, the
>> freshly-materialised PTE table sticks around, and only
>> khugepaged can later collapse the range back into a THP.
>> do_huge_pmd_swap_page() reinstalls the PMD mapping directly in
>> one fault, no khugepaged involvement.
>
> Ack, that's nice.
>
>> - Memory saved per swapped-out THP *once lazy page table deposit is
>> merged* [2]. With lazy page table deposit [2], splitting a PMD into
>> 512 PTE swap entries forces allocation of a 4 KB PTE table page.
>> The new path leaves the pgtable hierarchy at PMD level and avoids
>> that allocation entirely.
>> This will save memory when swapping, which is likely when there is
>> memory pressure and exactly when allocations are most likely to
>> fail.
>
> Also ack.
>
>> - Walkers (zap, mprotect, smaps, pagemap, soft-dirty, uffd-wp)
>> visit one PMD entry instead of 512 PTEs, reducing traversal
>> time and lock-hold windows.
>
> Right.
>
>>
>> The swap entry value is identical to 512 PTE swap entries (same
>> type, same starting offset), so swap_map refcounting is unchanged.
>> Only the page-table representation differs; the swap slot allocator,
>> swap I/O, and swap cache are untouched. The new path falls back to
>> the existing PTE-split path whenever a PMD-order resource is
>> unavailable: zswap enabled, non-contiguous swap allocation
>> (THP_SWPOUT_FALLBACK), PMD-order folio allocation failure on swap-in
>> or fork, racing folio split, or rmap-driven split on a swapcache
>> folio. Walkers that previously assumed every non-present PMD encodes
>> a PFN (migration / device_private) are taught to recognise PMD swap
>> entries.
>
> All sounds nice. I'll get to review this soon. LSF/MM and travel will slow me a
> bit down in May :(
>
Thanks! Appreciate it!