Re: [PATCH v5 00/21] Virtual Swap Space

From: Yosry Ahmed

Date: Fri Apr 24 2026 - 15:15:43 EST


> > > > IOW, I think the whole reason we want a virtual layer is to separate
> > > > the backends, which would facilitate tiering. If the virtual layer is
> > > > itself a swapfile, wouldn't it become one of the tiers?
> > >
> > > That's exactly what I hoped, virtual layer being part of the tier.
> > > Tier could be set up per task / cgroup. So is the virtual tier.
> >
> > Just to clarify. I don't think virtual swap should be one of the
> > tiers. I think it should be the mechanism through which we implement
> > tiering (see above). I am not sure if that's what you meant.
>
> YoungJun's swap tier have been working pretty well without the virtual part:
> https://lore.kernel.org/linux-mm/20260421055323.940344-1-youngjun.park@xxxxxxx/

Does this do promotion/demotion of swap entries?

> > > A standalone implementation of the virtual layer is more heavy than
> > > being a swapfile. Actually I think at this point, it is the word
> > > "swapfile" is misleading now. We may rename it to "swap mapping" or
> > > something. A swap mapping could be physical or virtual. Virtual
> > > mapping can realloc from physical ones (redirect), and swapoff of
> > > physical ones just read its data into virtual mapping's swap cache.
> >
> > I don't understand this part, please clarify. In my mind, all
> > references to swap entries from outside backend code should refer to a
> > virtual swap ID, which could be pointing to physical swap or zswap or
> > something else.
>
> For example just reserve a type (e.g. type 0) as the virtual type?
> (type is really a bad naming though).
>
> The that swap file (or swap mapping) will be
>
> I was trying that based on this:
> https://lore.kernel.org/linux-mm/20260220-swap-table-p4-v1-15-104795d19815@xxxxxxxxxxx/
>
> It seems to work and the only thing we need is actually just something
> like this one in VSS:
> https://lore.kernel.org/linux-mm/20260320192735.748051-15-nphamcs@xxxxxxxxx/
>
> This part:
> + /* fall back to physical swap device */
> + if (!vswap_alloc_swap_slot(folio)) {
>
> We do a folio_realloc_swap if folio->swap have type 0.
>
> Which means, if there is no virtual device / mapping / file / space
> (I'm not sure how to name it at this point :) ), the ordinary swap
> routine is just still there untouched.
>
> If there is one, and it's being used, then, it is still the ordinary
> swap routine, just do an extra allocation (and the extra allocation
> strictly follows YoungJun's tier rule), which is same with VSS, but
> everything is reused. From a user or high level interface perspective,
> this can be designed with no difference as VSS. Just with a few
> bonuses: being per memcg / task / runtime optional, zero overhead if
> not enabled, and reusing all the infra.
>
> BTW this deferred allocation (in VSS or dynamic swap mapping, similar
> thing) is actually a bit concerning to me as well. It changes the
> common swapout routine and maybe worth reconsideration (e.g.
> activate_locked_split and mTHP stats is now ignored?), being optional
> for now also seems safer.

I am not sure if I understand you correctly. I think what you're proposing is:

- Page tables either point directly to a swap slot, or to a virtual swap entry.
- By default, page tables just point to swap slots maintaining current behavior.
- If we have multiple backends (e.g. zswap or tiering), we use virtual
swap entry instead.
- The physical swapfile has clusters and swap tables (status quo).
- Virtual swap is implemented with clusters and swap tables in a
virtual space, and each table entry points to an underlying swap slot
or zswap entry.
- If a page table has a physical swap slot, and we need to do tiering,
we basically "make it virtual" by making the swap table of the
physical swapfile point at a virtual swap entry? or another physical
swapfile? Not sure.

In this design we have swap tables in both the virtual swap space as
well as the physical swapfile, right? How does this work? Where does
the metadata/swapcache live?

I am not sure if I got it right, I am a bit confused.

>
> > I *think* what you're saying is that we should make that optional, but
> > I don't see how this would work. If a page table is pointing at a swap
> > slot in a swapfile, we cannot do tiering or zswap writeback or
> > anything dynamic without updating page tables. So even if the system
> > starts off with one swapfile, we cannot assume we won't add more and
> > set up tiering (or enable zswap) after that, right?
> >
> > I guess we'll keep the swap table in the swapfile and then we'll have
> > it point to a different backend, but I really don't like this design.
> > It's unnecessarily complicated in my opinion. Page tables will either
> > refer to a virtual swap ID or a physical swap slot.
>
> Or in another word, they are all just swap entries, and the swap layer
> handles things internally.
>
> > I think we can simply have swap tables representing the virtual swap
> > space and pointing at the backend directly, whether or not we have
> > zswap or tiering set up or not. Is the overhead really that bad?
>
> Right... I mean with two layers you will likely have >16 bytes
> overhead, and double lookup.

Why >16 bytes? Do we need anything extra other than the reverse
mapping? Also why do we need a double lookup?

> And I have been thinking about cutting
> down the memory usage to 3 bytes. And you can't make the lower /
> physical layer just a bitmap if you want a reverse mapping, and so far
> many things do require that. If we make the reverse mapping optional
> it might be more complicated than the thing we discussed.
>
> I don't think the thing I described above is that complicated reading
> all the code and solutions so far. Maybe some better abstraction can
> help?

I don't think I quite understand it yet, maybe I am the problem :)

>
> I've seen some vendors doing swap using UFFD just to cut down the
> overhead or having a highly customized backend solution for swap, so I
> was hoping the kernel part could be as minimal as possible.

Interesting.