Re: [PATCH 00/13] mm: PMD-level swap entries for anonymous THPs
From: David Hildenbrand (Arm)
Date: Tue Apr 28 2026 - 15:57:05 EST
On 4/27/26 12:01, Usama Arif wrote:
> When reclaim swaps out a PMD-mapped anonymous THP today, the PMD is
> split into 512 PTE-level swap entries via TTU_SPLIT_HUGE_PMD before
> unmap.
>
> This series introduces a PMD-level swap entry. The huge mapping is
> preserved across the swap round-trip, and do_huge_pmd_swap_page()
> resolves the entire 2 MB region in a single fault on swap-in,
> no khugepaged involvement is needed. swap_map metadata is identical
> either way (512 single-slot counts), so the PTE split buys nothing
> on the swap side, it is purely a page-table representation change.
>
> This work was brought about after Hugh reported that one of the
> major blockers for having lazy page table deposit is the lack of
> PMD swap entries [1]. However, this series has benefits of its
> own:
> - The huge mapping is restored on swap-in. Today even when the
> folio is still in swap cache as a single 2 MB folio, the swap-in
> path installs 512 PTE mappings -- the PMD mapping is gone, the
> freshly-materialised PTE table sticks around, and only
> khugepaged can later collapse the range back into a THP.
> do_huge_pmd_swap_page() reinstalls the PMD mapping directly in
> one fault, no khugepaged involvement.
Ack, that's nice.
> - Memory saved per swapped-out THP *once lazy page table deposit is
> merged* [2]. With lazy page table deposit [2], splitting a PMD into
> 512 PTE swap entries forces allocation of a 4 KB PTE table page.
> The new path leaves the pgtable hierarchy at PMD level and avoids
> that allocation entirely.
> This will save memory when swapping, which is likely when there is
> memory pressure and exactly when allocations are most likely to
> fail.
Also ack.
> - Walkers (zap, mprotect, smaps, pagemap, soft-dirty, uffd-wp)
> visit one PMD entry instead of 512 PTEs, reducing traversal
> time and lock-hold windows.
Right.
>
> The swap entry value is identical to 512 PTE swap entries (same
> type, same starting offset), so swap_map refcounting is unchanged.
> Only the page-table representation differs; the swap slot allocator,
> swap I/O, and swap cache are untouched. The new path falls back to
> the existing PTE-split path whenever a PMD-order resource is
> unavailable: zswap enabled, non-contiguous swap allocation
> (THP_SWPOUT_FALLBACK), PMD-order folio allocation failure on swap-in
> or fork, racing folio split, or rmap-driven split on a swapcache
> folio. Walkers that previously assumed every non-present PMD encodes
> a PFN (migration / device_private) are taught to recognise PMD swap
> entries.
All sounds nice. I'll get to review this soon. LSF/MM and travel will slow me a
bit down in May :(
--
Cheers,
David