Re: [PATCH v5 00/21] Virtual Swap Space

From: Nhat Pham

Date: Fri May 01 2026 - 10:17:09 EST

On Fri, Apr 24, 2026 at 8:52 PM Kairui Song <ryncsn@xxxxxxxxx> wrote:
>
> On Sat, Apr 25, 2026 at 3:12 AM Yosry Ahmed <yosry@xxxxxxxxxx> wrote
> > > https://lore.kernel.org/linux-mm/20260421055323.940344-1-youngjun.park@xxxxxxx/
> >
> > Does this do promotion/demotion of swap entries?
>
> Not yet, let's do things step by step.
>
> > > For example just reserve a type (e.g. type 0) as the virtual type?
> > > (type is really a bad naming though).
> > >
> > > The that swap file (or swap mapping) will be
> > >
> > > I was trying that based on this:
> > > https://lore.kernel.org/linux-mm/20260220-swap-table-p4-v1-15-104795d19815@xxxxxxxxxxx/
> > >
> > > It seems to work and the only thing we need is actually just something
> > > like this one in VSS:
> > > https://lore.kernel.org/linux-mm/20260320192735.748051-15-nphamcs@xxxxxxxxx/
> > >
> > > This part:
> > > + /* fall back to physical swap device */
> > > + if (!vswap_alloc_swap_slot(folio)) {
> > >
> > > We do a folio_realloc_swap if folio->swap have type 0.
> > >
> > > Which means, if there is no virtual device / mapping / file / space
> > > (I'm not sure how to name it at this point :) ), the ordinary swap
> > > routine is just still there untouched.
> > >
> > > If there is one, and it's being used, then, it is still the ordinary
> > > swap routine, just do an extra allocation (and the extra allocation
> > > strictly follows YoungJun's tier rule), which is same with VSS, but
> > > everything is reused. From a user or high level interface perspective,
> > > this can be designed with no difference as VSS. Just with a few
> > > bonuses: being per memcg / task / runtime optional, zero overhead if
> > > not enabled, and reusing all the infra.
> > >
> > > BTW this deferred allocation (in VSS or dynamic swap mapping, similar
> > > thing) is actually a bit concerning to me as well. It changes the
> > > common swapout routine and maybe worth reconsideration (e.g.
> > > activate_locked_split and mTHP stats is now ignored?), being optional
> > > for now also seems safer.
> >
> > I am not sure if I understand you correctly. I think what you're proposing is:
> >
> > - Page tables either point directly to a swap slot, or to a virtual swap entry.
> > - By default, page tables just point to swap slots maintaining current behavior.
>
> I mean, they are all swap entries, nothing special from the page table
> side. Swap subsystems handle things internally.
>
> > - If we have multiple backends (e.g. zswap or tiering), we use virtual
> > swap entry instead.
>
> Actually that can just follow the swap priority, or tier rule. Even if
> virtual mapping exists, it can be bypassed. e.g. you have a large NBD
> and don't care about either fragmentation or compression for offline
> workload cgroups, then why use a virtual layer for them which could
> double the kmem usage or spend more CPU? Setup is a different issue
> which can be discussed.

I assume NBD == network block device here?

If you use a NBD, I think vswap overhead is not going to be the
bottleneck here :)

And, what about reliability. Say you allocate a slot on the NBD, unmap
the page from the PTEs, then proceed to swap_writeout(). What if the
NBD device is no longer available? What if IO fails? If you already
encode the physical swap slot location in the PTEs, then it's very
expensive to correct this mistake. Whereas with vswap, you can fall
back to another device if you so choose, and all it takes is just a
simple backend change at the vswap layer.

Another issue with the current physical swapfile allocator is that
induces physical contiguity where it's absolutely not needed. I don't
know if this is the case with an NBD, but for flash device for e.g,
obviously contiguity makes thing more efficient, but it would be nice
if we can fallback to discontiguous swapout as a fallback.

I feel like NBD is an argument FOR virtualization, not against.

>
> > - The physical swapfile has clusters and swap tables (status quo).
> > - Virtual swap is implemented with clusters and swap tables in a
> > virtual space, and each table entry points to an underlying swap slot
> > or zswap entry.
> > - If a page table has a physical swap slot, and we need to do tiering,
> > we basically "make it virtual" by making the swap table of the
> > physical swapfile point at a virtual swap entry? or another physical
> > swapfile? Not sure.
>
> They are still ordinary swap entries, nothing special. The virtual
> space is also just a ordinary swap file (or swap mapping), which is
> easy to do:
> https://lore.kernel.org/linux-mm/20260220-swap-table-p4-v1-15-104795d19815@xxxxxxxxxxx/
>
> Then its virtual_table will have a different set of swap entries. (I
> left that part undone though).
>
> > > Right... I mean with two layers you will likely have >16 bytes
> > > overhead, and double lookup.
> >
> > Why >16 bytes? Do we need anything extra other than the reverse
> > mapping? Also why do we need a double lookup?
>
> You will have to store at least the following info: memcg (2 bytes),
> shadow (8 bytes), count (at least 1 bytes), and revert mapping (8
> bytes, since you have to address a full virtual swap space). And some
> type info is also needed. Part of them can be shrinked but still,
> scientifically, merging two layers into one is considered a kind of
> optimization.

Optimization is always a worthwhile pursuit of course. But you have to
gauge it with what we can buy with a more flexbility design, which
might end up buying us more performance win down the line

In the immediate term, vswap buys you a dynamic compressed layer +
maintain the ability to write back.

Looking a bit longer term, I don't think you can do the following
without a layer of indirection here:

1. Compressed writeback.

2. Discontiguous swapouts. I think we need this as a fallback for THP
swapping (see [1] for the discussion).

3. Mixed backend swapin.

4. Optimizing swap IO - if sequential patterns matter for example, you
need the ability to delay or change backend allocation. The current
model is way too inflexible to allow for that.

5. Adding new swap backends. We want to decouple what the MM subsystem
needs (which is minimally captured in the virtual layer), with what
the backend itself wants.

Youngjun's paper is a case study for what you can buy with virtualization:

[1]: https://lore.kernel.org/all/6869b7f0-84e1-fb93-03f1-9442cdfe476b@xxxxxxxxxx/
[2]: https://ieeexplore.ieee.org/document/8662047

>
> You need lookup the virtual layer, then the lower layer for many
> decision making, is was discussed before to introduce more cache bit
> or things like that and I think that is getting over complex, reminds
> me of the slot cache or HAS_CACHE thing...:
> https://lore.kernel.org/linux-mm/CAMgjq7DJrtE-jARik849kCufd0qNnZQs7C8fcyzVOKE14-O+Dw@xxxxxxxxxxxxxx/
>
> > I don't think I quite understand it yet, maybe I am the problem :)
>
> Haha, not at all! Blame me for the poor explanation. To be honest, the
> design is still evolving and there are definitely details that need to
> be improved. It's hard to discuss these abstractions purely in theory,
> so it's probably best just keep the works moving forward in a clean
> way, and make things simpler and better be opt-in first.