Re: [PATCH v5 00/21] Virtual Swap Space

From: Kairui Song

Date: Fri Apr 24 2026 - 15:52:35 EST

On Sat, Apr 25, 2026 at 3:12 AM Yosry Ahmed <yosry@xxxxxxxxxx> wrote
> > https://lore.kernel.org/linux-mm/20260421055323.940344-1-youngjun.park@xxxxxxx/
>
> Does this do promotion/demotion of swap entries?

Not yet, let's do things step by step.

> > For example just reserve a type (e.g. type 0) as the virtual type?
> > (type is really a bad naming though).
> >
> > The that swap file (or swap mapping) will be
> >
> > I was trying that based on this:
> > https://lore.kernel.org/linux-mm/20260220-swap-table-p4-v1-15-104795d19815@xxxxxxxxxxx/
> >
> > It seems to work and the only thing we need is actually just something
> > like this one in VSS:
> > https://lore.kernel.org/linux-mm/20260320192735.748051-15-nphamcs@xxxxxxxxx/
> >
> > This part:
> > + /* fall back to physical swap device */
> > + if (!vswap_alloc_swap_slot(folio)) {
> >
> > We do a folio_realloc_swap if folio->swap have type 0.
> >
> > Which means, if there is no virtual device / mapping / file / space
> > (I'm not sure how to name it at this point :) ), the ordinary swap
> > routine is just still there untouched.
> >
> > If there is one, and it's being used, then, it is still the ordinary
> > swap routine, just do an extra allocation (and the extra allocation
> > strictly follows YoungJun's tier rule), which is same with VSS, but
> > everything is reused. From a user or high level interface perspective,
> > this can be designed with no difference as VSS. Just with a few
> > bonuses: being per memcg / task / runtime optional, zero overhead if
> > not enabled, and reusing all the infra.
> >
> > BTW this deferred allocation (in VSS or dynamic swap mapping, similar
> > thing) is actually a bit concerning to me as well. It changes the
> > common swapout routine and maybe worth reconsideration (e.g.
> > activate_locked_split and mTHP stats is now ignored?), being optional
> > for now also seems safer.
>
> I am not sure if I understand you correctly. I think what you're proposing is:
>
> - Page tables either point directly to a swap slot, or to a virtual swap entry.
> - By default, page tables just point to swap slots maintaining current behavior.

I mean, they are all swap entries, nothing special from the page table
side. Swap subsystems handle things internally.

> - If we have multiple backends (e.g. zswap or tiering), we use virtual
> swap entry instead.

Actually that can just follow the swap priority, or tier rule. Even if
virtual mapping exists, it can be bypassed. e.g. you have a large NBD
and don't care about either fragmentation or compression for offline
workload cgroups, then why use a virtual layer for them which could
double the kmem usage or spend more CPU? Setup is a different issue
which can be discussed.

> - The physical swapfile has clusters and swap tables (status quo).
> - Virtual swap is implemented with clusters and swap tables in a
> virtual space, and each table entry points to an underlying swap slot
> or zswap entry.
> - If a page table has a physical swap slot, and we need to do tiering,
> we basically "make it virtual" by making the swap table of the
> physical swapfile point at a virtual swap entry? or another physical
> swapfile? Not sure.

They are still ordinary swap entries, nothing special. The virtual
space is also just a ordinary swap file (or swap mapping), which is
easy to do:
https://lore.kernel.org/linux-mm/20260220-swap-table-p4-v1-15-104795d19815@xxxxxxxxxxx/

Then its virtual_table will have a different set of swap entries. (I
left that part undone though).

> > Right... I mean with two layers you will likely have >16 bytes
> > overhead, and double lookup.
>
> Why >16 bytes? Do we need anything extra other than the reverse
> mapping? Also why do we need a double lookup?

You will have to store at least the following info: memcg (2 bytes),
shadow (8 bytes), count (at least 1 bytes), and revert mapping (8
bytes, since you have to address a full virtual swap space). And some
type info is also needed. Part of them can be shrinked but still,
scientifically, merging two layers into one is considered a kind of
optimization.

You need lookup the virtual layer, then the lower layer for many
decision making, is was discussed before to introduce more cache bit
or things like that and I think that is getting over complex, reminds
me of the slot cache or HAS_CACHE thing...:
https://lore.kernel.org/linux-mm/CAMgjq7DJrtE-jARik849kCufd0qNnZQs7C8fcyzVOKE14-O+Dw@xxxxxxxxxxxxxx/

> I don't think I quite understand it yet, maybe I am the problem :)

Haha, not at all! Blame me for the poor explanation. To be honest, the
design is still evolving and there are definitely details that need to
be improved. It's hard to discuss these abstractions purely in theory,
so it's probably best just keep the works moving forward in a clean
way, and make things simpler and better be opt-in first.