Re: [RFC PATCH 0/5] mm, swap: Virtual Swap Space (Swap Table Edition)

From: Nhat Pham

Date: Wed Jun 03 2026 - 13:34:46 EST

On Wed, Jun 3, 2026 at 10:12 AM Nhat Pham <nphamcs@xxxxxxxxx> wrote:
>
> On Tue, Jun 2, 2026 at 6:29 PM Yosry Ahmed <yosry@xxxxxxxxxx> wrote:
> >
> > > II. Design
> > >
> > > With vswap, pages are assigned virtual swap entries on a ghost device
> > > with no backing storage. These entries are backed by zswap, zero pages,
> > > or (lazily) physical swap slots. Physical backing is allocated only
> > > when needed — on zswap writeback or reclaim writeout, after the rmap
> > > step.
> > >
> > > Compared to the standalone v6 implementation [1], which introduces a
> > > 24-byte per-entry swap descriptor and its own cluster allocator, this
> > > edition uses swap_table infrastructure, and share a lot of the allocator
> > > logic. Per-slot metadata is stored in a tag-encoded virtual_table
> > > (atomic_long_t, 8 bytes per slot), and physical clusters store
> > > Pointer-tagged rmap entries in the swap_table for reverse lookup back to
> > > the virtual cluster.
> > >
> > > Here are some data layout diagrams:
> > >
> > > Case 1: vswap entry (virtualized)
> > >
> > > PTE swap_cluster_info_dynamic
> > > vswap_entry +-------------------------+
> > > (swp_entry_t) ------>| swap_cluster_info (ci) |
> > > | +--------------------+ |
> > > | | swap_table | |
> > > | | PFN / Shadow | |
> > > | | memcg_table | |
> > > | | count,flags,order | |
> > > | | lock, list | |
> > > | +--------------------+ |
> > > | |
> > > | virtual_table |
> > > | +--------------------+ |
> > > | | NONE | |
> > > | | PHYS | |
> > > | | ZERO | |
> > > | | ZSWAP(entry*) | |
> > > | | FOLIO(folio*) | |
> > > | +--------------------+ |
> > > +-------------------------+
> > > |
> > > | PHYS resolves to
> > > v
> > > PHYSICAL CLUSTER (swap_cluster_info)
> > > +--------------------------+
> > > | swap_table per-slot: |
> > > | NULL - free |
> > > | PFN - cached folio |
> > > | Shadow - swapped out |
> > > | Pointer- vswap rmap |
> > > | Bad - unusable |
> > > | |
> > > | Vswap-backing slot: |
> > > | Pointer(C|swp_entry_t) |
> > > | rmap back to vswap |
> > > +--------------------------+
> > >
> > > Case 2: direct-mapped physical entry (no vswap)
> > >
> > > PTE PHYSICAL CLUSTER (swap_cluster_info)
> > > phys_entry +--------------------------+
> > > (swp_entry_t) ------>| swap_table per-slot: |
> > > | NULL - free |
> > > | PFN - cached folio |
> > > | Shadow - swapped out |
> > > | Bad - unusable |
> > > +--------------------------+
> > >
> > > struct swap_cluster_info_dynamic {
> > > struct swap_cluster_info ci; /* swap_table, lock, etc. */
> > > unsigned int index; /* position in xarray */
> > > struct rcu_head rcu; /* kfree_rcu deferred free */
> > > atomic_long_t *virtual_table; /* backend info, 8 B/slot */
> > > };
> > >
> > > Each vswap cluster (swap_cluster_info_dynamic) extends the classic
> > > swap_cluster_info struct with a virtual_table array that stores the
> > > backend information for each virtual swap entry in the cluster. Each
> > > entry is tag-encoded in the low 3 bits to indicate backend types:
> > >
> > > NONE: |----- 0000 ------|000| free / unbacked
> > > PHYS: |-- (type:5,off:N)|001| on a physical swapfile (shifted)
> > > ZERO: |----- 0000 ------|010| zero-filled page
> > > ZSWAP: |--- zswap_entry* |011| compressed in zswap
> > > FOLIO: |--- folio* ------|100| in-memory folio
> > >
> > > We still have room for 3 more future backend types, for e.g. CRAM, i.e
> > > compressed-CXL-as-swap, which is laid out in [10] and [11]. Worst
> > > case scenario, we can add more fields to this extended struct.
> > >
> > > Other design points:
> > > - Both vswap entries (Case 1) and directly-mapped physical entries
> > > (Case 2) coexist as first-class citizens. All the common swap
> > > code paths — swapout, swapin, swap freeing, swapoff, zswap
> > > writeback, THP swapin, etc. work for both. When CONFIG_VSWAP=n,
> > > the vswap branches compile out and behavior should be identical to
> > > today's swap-table P4 (at least that is my intention).
> > > - Pointer-tagged swap_table on physical clusters for rmap (physical
> > > -> virtual) lookup.
> > > - Virtual swap slots not backed by physical swap are not charged to
> > > memcg swap counters — only physical backing is charged (I made the
> > > case for this in [7]).
> > > - Careful separation of vswap and physical swap allocation paths and
> > > structures adds a lot of complexity, but is crucial to make sure
> > > both paths are efficient and do not conflict with each other (for
> > > correctness and performance). I do re-use a lot of the allocation
> > > logic wherever possible though.
> >
> > Thanks for working on this! I mostly looked at the high-level design and
>
> Thank you for initiating this effort in LSFMMBPF 2023 (god, time
> flies). I was very excited by your presentation and decided to take a
> stab at it :)
>
> (I'll be sure to mention the full context in a non-RFC version - it
> has a lot of gems in our technical discussions).
>
> > the zswap parts, as the swap code has changed a lot since I was familiar
> > with it :)
>
> It has changed a lot since 6.19, when I was working on v6. Very
> exciting time to be a (z)swap developer right now - we have new ideas
> and new features every other week :) Reviewing code has been quite a
> joy (albeit a lot of work).
>
> >
> > It seems like the direction being taken here is that we have one
> > (massive) vswap swap device, and we keep normal physical swap devices
> > around as well.
>
> Yep.
>
> >
> > A vswap entry can point at a physical swap entry, or zswap, or zeromap.
> > If a vswap entry points at a physical swap entry, then the physical swap
> > entry points back at the vswap entry (a reverse mapping).
>
> Yep.
>
> >
> > I assume the main reason here is to avoid the extra overhead if
> > everything uses vswap, which would mainly be the reverse mapping
> > overhead? I guess there's also some simplicity that comes from reusing
> > the swap info infra as a whole, including the swap table.
>
> Yeah it helps a lot that we don't have to rewrite the whole allocator
> and swap entry reference counting logic again :)
>
> >
> > I don't like that the code bifurcates for vswap vs. normal swap entries
> > though. Not sure if this is an issue that can be fixed with proper
> > abstractions to hide it, or if the design needs modifications. I was
> > honestly really hoping we don't end up with this. I was hoping that the
> > physical swap device no longer uses a full swap table and all, and
> > everything goes through vswap.
> >
> > I hoping that if redirection isn't needed (e.g. zswap is disabled),
> > vswap can directly encode the physical swap slot so that the reverse
> > mapping isn't needed -- so we avoid the overhead without keeping the
> > physical swap device using a fully-fledged swap table.
>
> Can you expand on "vswap can directly encode the physical swap slot"?
> I'm not sure I follow here.
>
> >
> > All that being said, perhaps I am too out of touch with the code to
> > realize it's simply not possible.
> >
> > Honestly, if the main reason we can't have a single swap table for vswap
> > is saving 8 bytes on the reverse mapping, it sounds like a weak-ish
> > argument, even if we can't optimize the reverse mapping away. But maybe
> > I am also out of touch with RAM prices :)
>
> In terms of the space overhead I do agree, FWIW :)
>
> I think the other concern is the indirection overhead with going
> through the xarray for every swap operation, hence the per-CPU vswap
> cluster lookup caching idea:
>
> https://lore.kernel.org/all/20260505153854.1612033-23-nphamcs@xxxxxxxxx/
>
> >
> > I at least hope that, the current design is not painting us into a
> > corner (e.g. through userspace interfaces), and we can still achieve a
> > vswap-for-all implementation in the future (maybe that's what you have
> > in mind already?).
>
> That's still my plan. Operationally speaking, I want to make this
> completely transparent to users, with minimal to no performance
> overhead.

I do want to add that, even without achieving this, the current design
already enables a lot of use cases. I think it is a good compromise to
maintain both virtual and directly mapped physical swap entries for
now, and revisit the conversation of whether we can afford a mandatory
vswap layer once all the optimizations have been done :)

We should strive to simplify the codebase, and it will naturally
happen when the original overhead concern is no longer there. A
swap-related example: a few years ago, everyone thought swap slot
cache was needed. But then, Kairui optimized the swap allocator's lock
contention issue away, and that swap slot cache is suddenly redundant.
That finally allowed us to get rid of it. Similar thing happened (or
is happening?) with the SWP_SYNCHRONOUS_IO swapcache-skipping
heuristics.