Re: [v2 00/16] mm: PMD-level swap entries for anonymous THPs

From: Kairui Song

Date: Sun Jun 14 2026 - 13:33:26 EST


On Sun, Jun 14, 2026 at 3:19 AM Usama Arif <usama.arif@xxxxxxxxx> wrote:
>
>
>
> On 13/06/2026 05:22, Lance Yang wrote:
> >
> >
> > After skimming through the whole series, probably PMD swap entries need
> > one bigger rethink ...
> >
> > Emm ... same tricky bit keeps showing up ...
> >
> > One PMD swap entry is easy to handle while the swapcache still has one
> > PMD-sized folio behind it. Once taht folio got split and reclaimed, the
> > 512 swap slots need per-page handling :)
> >
> > Maybe worth first pinning down the rule here.
> >
> > Is a PMD swap entry supposed to mean "there is, or soon will be, one PMD-
> > sized folio behnid it", or is just a compact page-table encoding for
> > 512 swap slot?
> >
> > Without that rule being very clear, every caller has to guess how much
> > it can assume, and it is easy to miss one ...
> >
> > So I stopped staring at the details for now, because the same issue keeps
> > popping up wearing a slightly different hat :)
> >
> > Anyway, no clever answer from me here, not a swap expect :( Just pointing
> > out the pattern I keep runing into.
> >
>
> Thanks for the amazing reviews!
>
> For the next revision I’m going to treat a PMD swap entry as just a compact
> page-table encoding for 512 ordinary swap slots. It does not mean that the
> swapcache still has, or will soon have, one PMD-sized folio behind it.
>
> With that rule, whole-PMD handling is only valid when either:
>
> 1. the swapcache still has one PMD-sized folio for the range, or
> 2. the whole PMD swap range has no cached folios, so the caller can try a
> PMD-sized swapin and still fall back if that is not possible.
>
> If any slot in the range has per-page cache state, the PMD entry has to be
> split and the existing PTE paths need to handle the individual slots.
>
> I an reworking the next revision around that. I added a shared helper to
> classify the swapcache behind a PMD swap entry as empty, PMD-sized, or
> split, then used it in the places where this assumption mattered:
> mincore, UFFDIO_MOVE, swapoff, MADV_WILLNEED, and the PMD swap fault path.
> UFFDIO_MOVE now checks the whole 512-slot range before moving a PMD swap
> entry without a cached folio, and falls back to PTE handling if per-page
> cached folios exist.

Looks interesting, sorry I haven't go though the whole series in
detail yet, one question from me is how much different is it compared
to the current THP handling? We are already dealing with THP fallback
where sub-pages (of any order) could exist in the swap cache range
covered by an incoming (swapin) (m)THP, the only thing different here
is we need to do split, right? Meanwhile shmem already has similar
split logic (split of large shmem mapping xarray entry), maybe they
are somehow similiar so some routine can be shared or reused?