Re: [RFC PATCH 0/5] mm, swap: Virtual Swap Space (Swap Table Edition)

From: Nhat Pham

Date: Wed Jun 03 2026 - 13:15:21 EST


On Tue, Jun 2, 2026 at 6:29 PM Yosry Ahmed <yosry@xxxxxxxxxx> wrote:
>
> > II. Design
> >
> > With vswap, pages are assigned virtual swap entries on a ghost device
> > with no backing storage. These entries are backed by zswap, zero pages,
> > or (lazily) physical swap slots. Physical backing is allocated only
> > when needed — on zswap writeback or reclaim writeout, after the rmap
> > step.
> >
> > Compared to the standalone v6 implementation [1], which introduces a
> > 24-byte per-entry swap descriptor and its own cluster allocator, this
> > edition uses swap_table infrastructure, and share a lot of the allocator
> > logic. Per-slot metadata is stored in a tag-encoded virtual_table
> > (atomic_long_t, 8 bytes per slot), and physical clusters store
> > Pointer-tagged rmap entries in the swap_table for reverse lookup back to
> > the virtual cluster.
> >
> > Here are some data layout diagrams:
> >
> > Case 1: vswap entry (virtualized)
> >
> > PTE swap_cluster_info_dynamic
> > vswap_entry +-------------------------+
> > (swp_entry_t) ------>| swap_cluster_info (ci) |
> > | +--------------------+ |
> > | | swap_table | |
> > | | PFN / Shadow | |
> > | | memcg_table | |
> > | | count,flags,order | |
> > | | lock, list | |
> > | +--------------------+ |
> > | |
> > | virtual_table |
> > | +--------------------+ |
> > | | NONE | |
> > | | PHYS | |
> > | | ZERO | |
> > | | ZSWAP(entry*) | |
> > | | FOLIO(folio*) | |
> > | +--------------------+ |
> > +-------------------------+
> > |
> > | PHYS resolves to
> > v
> > PHYSICAL CLUSTER (swap_cluster_info)
> > +--------------------------+
> > | swap_table per-slot: |
> > | NULL - free |
> > | PFN - cached folio |
> > | Shadow - swapped out |
> > | Pointer- vswap rmap |
> > | Bad - unusable |
> > | |
> > | Vswap-backing slot: |
> > | Pointer(C|swp_entry_t) |
> > | rmap back to vswap |
> > +--------------------------+
> >
> > Case 2: direct-mapped physical entry (no vswap)
> >
> > PTE PHYSICAL CLUSTER (swap_cluster_info)
> > phys_entry +--------------------------+
> > (swp_entry_t) ------>| swap_table per-slot: |
> > | NULL - free |
> > | PFN - cached folio |
> > | Shadow - swapped out |
> > | Bad - unusable |
> > +--------------------------+
> >
> > struct swap_cluster_info_dynamic {
> > struct swap_cluster_info ci; /* swap_table, lock, etc. */
> > unsigned int index; /* position in xarray */
> > struct rcu_head rcu; /* kfree_rcu deferred free */
> > atomic_long_t *virtual_table; /* backend info, 8 B/slot */
> > };
> >
> > Each vswap cluster (swap_cluster_info_dynamic) extends the classic
> > swap_cluster_info struct with a virtual_table array that stores the
> > backend information for each virtual swap entry in the cluster. Each
> > entry is tag-encoded in the low 3 bits to indicate backend types:
> >
> > NONE: |----- 0000 ------|000| free / unbacked
> > PHYS: |-- (type:5,off:N)|001| on a physical swapfile (shifted)
> > ZERO: |----- 0000 ------|010| zero-filled page
> > ZSWAP: |--- zswap_entry* |011| compressed in zswap
> > FOLIO: |--- folio* ------|100| in-memory folio
> >
> > We still have room for 3 more future backend types, for e.g. CRAM, i.e
> > compressed-CXL-as-swap, which is laid out in [10] and [11]. Worst
> > case scenario, we can add more fields to this extended struct.
> >
> > Other design points:
> > - Both vswap entries (Case 1) and directly-mapped physical entries
> > (Case 2) coexist as first-class citizens. All the common swap
> > code paths — swapout, swapin, swap freeing, swapoff, zswap
> > writeback, THP swapin, etc. work for both. When CONFIG_VSWAP=n,
> > the vswap branches compile out and behavior should be identical to
> > today's swap-table P4 (at least that is my intention).
> > - Pointer-tagged swap_table on physical clusters for rmap (physical
> > -> virtual) lookup.
> > - Virtual swap slots not backed by physical swap are not charged to
> > memcg swap counters — only physical backing is charged (I made the
> > case for this in [7]).
> > - Careful separation of vswap and physical swap allocation paths and
> > structures adds a lot of complexity, but is crucial to make sure
> > both paths are efficient and do not conflict with each other (for
> > correctness and performance). I do re-use a lot of the allocation
> > logic wherever possible though.
>
> Thanks for working on this! I mostly looked at the high-level design and

Thank you for initiating this effort in LSFMMBPF 2023 (god, time
flies). I was very excited by your presentation and decided to take a
stab at it :)

(I'll be sure to mention the full context in a non-RFC version - it
has a lot of gems in our technical discussions).

> the zswap parts, as the swap code has changed a lot since I was familiar
> with it :)

It has changed a lot since 6.19, when I was working on v6. Very
exciting time to be a (z)swap developer right now - we have new ideas
and new features every other week :) Reviewing code has been quite a
joy (albeit a lot of work).

>
> It seems like the direction being taken here is that we have one
> (massive) vswap swap device, and we keep normal physical swap devices
> around as well.

Yep.

>
> A vswap entry can point at a physical swap entry, or zswap, or zeromap.
> If a vswap entry points at a physical swap entry, then the physical swap
> entry points back at the vswap entry (a reverse mapping).

Yep.

>
> I assume the main reason here is to avoid the extra overhead if
> everything uses vswap, which would mainly be the reverse mapping
> overhead? I guess there's also some simplicity that comes from reusing
> the swap info infra as a whole, including the swap table.

Yeah it helps a lot that we don't have to rewrite the whole allocator
and swap entry reference counting logic again :)

>
> I don't like that the code bifurcates for vswap vs. normal swap entries
> though. Not sure if this is an issue that can be fixed with proper
> abstractions to hide it, or if the design needs modifications. I was
> honestly really hoping we don't end up with this. I was hoping that the
> physical swap device no longer uses a full swap table and all, and
> everything goes through vswap.
>
> I hoping that if redirection isn't needed (e.g. zswap is disabled),
> vswap can directly encode the physical swap slot so that the reverse
> mapping isn't needed -- so we avoid the overhead without keeping the
> physical swap device using a fully-fledged swap table.

Can you expand on "vswap can directly encode the physical swap slot"?
I'm not sure I follow here.

>
> All that being said, perhaps I am too out of touch with the code to
> realize it's simply not possible.
>
> Honestly, if the main reason we can't have a single swap table for vswap
> is saving 8 bytes on the reverse mapping, it sounds like a weak-ish
> argument, even if we can't optimize the reverse mapping away. But maybe
> I am also out of touch with RAM prices :)

In terms of the space overhead I do agree, FWIW :)

I think the other concern is the indirection overhead with going
through the xarray for every swap operation, hence the per-CPU vswap
cluster lookup caching idea:

https://lore.kernel.org/all/20260505153854.1612033-23-nphamcs@xxxxxxxxx/

>
> I at least hope that, the current design is not painting us into a
> corner (e.g. through userspace interfaces), and we can still achieve a
> vswap-for-all implementation in the future (maybe that's what you have
> in mind already?).

That's still my plan. Operationally speaking, I want to make this
completely transparent to users, with minimal to no performance
overhead.

The next action item is to optimize for vswap-on-fast-swapfile case -
that was Kairui's main concerns regarding performance. I spent a lot
of time perfing and fixing issues for this case in v6. The issues with
the most egregious effects and simplest fix (vswap-less
swap-cache-only check for e.g) are already fixed in this new design,
and eventually I will move the rest (lookup caching) and more to here.

>
> Aside from the swap code, the only sticking point for me is the logic
> bifurcation in zswap. Why does zswap need to handle vswap vs. not vswap?
> I thought the point of the design is to use vswap when zswap is used,
> and otherwise use a normal swap table. In a way, one of the goals is to
> make zswap a first class swap citizen, but it doesn't seem like we are
> achieving that?

We already have all the machinery to make zswap completely
independent. Right now, if you use vswap, you'll skip the zswap's
internal xarray entirely, and just store a zswap entry in the virtual
swap cluster's vtable.

I just haven't removed the old code for 2 reasons:

1. Reduce the delta on this RFC, to ease the burden for reviewers (and
definitely not because I'm lazy :P)

2. The only other practical reason is so that we can let users compile
with !CONFIG_VSWAP and still uses zswap on top of the old swapfile
setup during the transition/experimentation period for now.

But logically and conceptually speaking, there is no reason I can come
up with to use zswap on without vswap. The CPU indirection overhead is
already partially there (since zswap uses an xarray) and further
optimized (cluster loopup caching etc.), as well as the space overhead
(vswap replaces the zswap xarray). I actually wrote a whole paragraph
about how we should always go for vswap if we're using zswap, but then
decide to remove it since there's no code for it yet.

If folks like it, what I can do is have CONFIG_ZSWAP depends on
CONFIG_VSWAP, removes all the non-vswap logic, and call it a day? :)
Then, on the swap allocation side, if vswap allocation fail and zswap
writeback is disabled, we can error out early.