Re: [RFC PATCH 0/5] mm, swap: Virtual Swap Space (Swap Table Edition)

From: Yosry Ahmed

Date: Wed Jun 03 2026 - 15:38:04 EST

On Wed, Jun 3, 2026 at 12:26 PM Nhat Pham <nphamcs@xxxxxxxxx> wrote:
>
> On Wed, Jun 3, 2026 at 11:58 AM Yosry Ahmed <yosry@xxxxxxxxxx> wrote:
> >
> > > > I assume the main reason here is to avoid the extra overhead if
> > > > everything uses vswap, which would mainly be the reverse mapping
> > > > overhead? I guess there's also some simplicity that comes from reusing
> > > > the swap info infra as a whole, including the swap table.
> > >
> > > Yeah it helps a lot that we don't have to rewrite the whole allocator
> > > and swap entry reference counting logic again :)
> >
> > I specifically meant using a full swap info thing for the physical swap
> > device even when it's behind vswap. That seems like an overkill, and we
> > don't need things like the swap entry reference coutning. We probably
> > just need a bitmap and a reverse mapping.
> >
> > So I am assuming the main reason why we are not doing that (at least for
> > now) is simplicity?
>
> Mostly.
>
> FWIW, we're pretty close to full deduplication. Right now, physical
> swap clusters have a couple of fields that are not needed when they're
> backing a vswap cluster:
>
> 1. The main swap table (which houses swap cache, swap shadow, and
> reference counting): I repurpose it for the rmap :) It's an array of
> unsigned long, which works for rmap.
>
> 2. memcg_table: still duplicated, but I think I can make sure this is
> not allocated if physical swap clusters only back vswap entries. I
> have a prototype that I'm testing for this.
>
> 3. The zeromap field: this is actually not allocated in 64 bit
> architecture, IIUC, which is what I'm gating CONFIG_VSWAP on. If we
> extend vswap to supporting 32 bits, this can also be dynamically
> allocated.
>
> 4. Extend table - this is for the swap count overfills, and already
> dynamically allocated.

I see.

> > > > All that being said, perhaps I am too out of touch with the code to
> > > > realize it's simply not possible.
> > > >
> > > > Honestly, if the main reason we can't have a single swap table for vswap
> > > > is saving 8 bytes on the reverse mapping, it sounds like a weak-ish
> > > > argument, even if we can't optimize the reverse mapping away. But maybe
> > > > I am also out of touch with RAM prices :)
> > >
> > > In terms of the space overhead I do agree, FWIW :)
> > >
> > > I think the other concern is the indirection overhead with going
> > > through the xarray for every swap operation, hence the per-CPU vswap
> > > cluster lookup caching idea:
> > >
> > > https://lore.kernel.org/all/20260505153854.1612033-23-nphamcs@xxxxxxxxx/
> >
> > Right, but we should already avoid the xarray with the swap table
> > design, right? We just have one swap table pointing to another
> > essentially?
>
> Hmmm, I don't quite follow your suggestion here.
>
> For normal swap devices, we organize the space into clusters, and
> maintain them in various lists (free, nonfull, full etc.). The only
> difference with a vswap device is we do not have a free list, and have
> the clusters themselves dynamically allocated.
>
> If we're using vswap, we will incur the xarray overhead. There's no
> avoiding that if we want a dynamic indirection layer. We can of course
> revisit this data structure design later.
>
> So yes, it will be one swap table (vswap cluster) pointing to another
> swap table (pswap cluster). But to get to the first swap table, you
> will have to go through xarray still.

Why the xarray? Don't page tables (and shmem page cache) just point
directly to the vswap entry the same way they point to swap entries
today?

*looks at the code*

Oh, it's to find the actual cluster because the vswap file can be
sparse? Hmm yeah I guess we can revisit the data structure here later,
but IIRC xarrays aren't particularly good for sparse data. Maybe it's
usually not sprase in practice.

Maybe a maple tree? :)

> > > If folks like it, what I can do is have CONFIG_ZSWAP depends on
> > > CONFIG_VSWAP, removes all the non-vswap logic, and call it a day? :)
> > > Then, on the swap allocation side, if vswap allocation fail and zswap
> > > writeback is disabled, we can error out early.
> >
> > Hmm maybe we can keep it around for now and do that after vswap
> > stabilizes? It ultimately depend on how much complexity we maintain by
> > allowing both.
> >
> > I think another problem is 32-bit, technically zswap can be used on
> > 32-bit now, right? So vswap not supporitng 32-bit is a problem.
>
> Ah shoot I forgot about that. Hmmm.
>
> It's not impossible to make vswap support 32-bit. I did that for v6
> after all. It just needs extra fields because we have fewer bits to
> leverage in pointers etc., complicating the logic a bit. Follow-up
> work? :)

Yeah we can do that, but it's a blocker for zswap only using vswap.

> > General question (for both zswap and general swap code), would a boot
> > param make implementation simpler? Right now we seem to key off the swap
> > device having the "vswap" flag, would it help if it was a runtime
> > constant?
>
> Hmmm, even if it's a runtime constant, both branches still have to be
> there, no? Does the boot param simplify it somehow?

Maybe it doesn't simplify the code, but if the branching causes
performance overhead we can use static keys. I guess we can still use
static keys per-swapfile, but it would be more complicated.

Anyway, not super important now.