Re: [RFC PATCH 0/5] mm, swap: Virtual Swap Space (Swap Table Edition)

From: Kairui Song

Date: Mon Jun 01 2026 - 13:46:23 EST

On Mon, Jun 1, 2026 at 11:57 PM Nhat Pham <nphamcs@xxxxxxxxx> wrote:
>
> On Mon, Jun 1, 2026 at 12:34 AM Kairui Song <ryncsn@xxxxxxxxx> wrote:
> >
> > For the format part, PHYS don't need that much bits I think,
> > so by slightly adjust the format vswap device could be share
> > mostly the same format with ordinary device.
> >
> > For example typical modern system don't have a address space larger
> > than 52 bit. (Even with full 64 bits used for addressing, shift it
> > by 12 we get 52). Plus 5 for type, you get 57, so you can have a
> > marker that should work as long as it shorter than 1000000 for PHYS,
> > and shared for all table format since it's not in conflict with
> > anything. You have also use a few extra bits so a single swap space
> > can be 8 times larger than RAM space, and since we can help
> > multiple swap type I think that should be far than enough?
> >
> > Then you have Shadow back at 001, and zero bit in shadow. The only
> > special one is Zswap, which will be 100 now, and that's exactly the
> > reserved pointer format in current swap table format, on seeing
> > si->flags & VSWAP && is_pointer(swp_tb) you know that's zswap :)
>
> Are you suggesting we merge the virtual table with main swap table?
>
> Man, I'd love to do this. There is a problem though - we have a case
> where we occupy both backing physical swap AND swap cache. Do you
> think we can fit both the physical swap slot handle and the swap cache
> PFN into the same slot in virtual table? Maybe with some expanding...?

I don't really get why we would need to do that? If you put the PFN
info in the virtual / upper layer, then the count info, locking, and
all swap IO synchronization (via folio lock), dup (current protected
by ci lock / folio lock), and allocation (folio_alloc_swap), are all
handled in this layer.

The physical / lower layer will just hold a reverse entry on
folio_realloc_swap, or no entry at all (no physical layer used, zswap,
or after swap allocation but before IO) right?

Looking up the actual folio from the physical layer will be a bit
slower since it needs to resolve the reverse entry, but the only place
we need to do that is things like migrate, compaction (none of them
exist yet) which seems totally fine?

> > This could be imporved by per-si percpu cluster. Both YoungJun's
> > tiering and Baoquan's previous swap ops mentioned this is needed,
> > and now vswap also need that. If the vswap is also a si, then it will
> > make use of this too.
>
> Yeah I made the same recommendation when I review swap tier last week:
>
> https://lore.kernel.org/all/CAKEwX=N2XcMHN1jatppOk6wnmz-Shab5XMtTtzgYOzRvU_6YFw@xxxxxxxxxxxxxx/
>
> I like it, but yeah it will be complicated. That said, I think not
> fixing the fast path for tiering/vswap will seriously restrict their
> usefulness. We don't want to go back to the old swap allocator days :)

Thanks. Not too complicated, actually our internal kernel
implementation still using si->percpu cluster, and use a counter for
the rotation and each order have a counter :P, it's a bit ugly but
works fine. It still serves pretty well just like the global percpu
cluster, YoungJun's previous per ci percpu cluster also still provides
the fast path, many ways to do that.

>
> >
> > YoungJun posted this a few month before:
> > https://lore.kernel.org/linux-mm/20260131125454.3187546-5-youngjun.park@xxxxxxx/
> >
> > The concern is that some locking contention could be heavier, or maybe
> > that's just a hypothetical problem though.
>
> I don't think it's hypothetical. At least with vswap, it's very easy
> to get into a state where the shared per-cpu cache gets invalidated
> constantly if phys swap and vswap allocation alternates (which is
> actually very possible under heavy memory pressure), hammering the
> slow paths...

I mean if the per-cpu cache is moved to si level, then whoever enters
the allocation path of a si will almost always get a stable percpu
cache to use, even if the last used si changes.

We might better have a more explicit per si fast path for that. The
plist rotation should better be done in a different (will be even
better if lockless) way.

>
> >
> > >
> > > - Runtime enable/disable of the vswap device. To be honest, I don't
> > > know if there is a value in this. My preference is vswap can be
> > > optimized to the point that any overhead is negligible. Failing that,
> > > maybe we can come up with some simple heuristics that automatically
> > > decides for users?
> > >
> > > In this RFC, CONFIG_VSWAP=y means the vswap device is always created at
> > > boot, and CONFIG_VSWAP=n means the vswap device is never created. This
> > > *might* be enough just on its own.
> > >
> > > Is a runtime knob (sysfs or sysctl) worth the complexity beyond
> > > these heuristics? I'm not sure yet. Maintaining both cases
> >
> > I checked the code and I think it's not hard to do, patch 1 already
> > handling the meta data dynamically, everything will still just work
> > even if you remove vswap at runtime. The rest of patches need adaption
> > but might not end up being complex, it other comments here
> > are considered.
>
> Yeah, it's not terribily hard to do. I'm more wondering if it's worth
> the effort, both for the implementer and the user :)
>
> As I said here, if we want vswap, just enable it at boot time and get
> a vast (but dynamic) device. We can make it optional per-cgroup
> through Youngjun's interface, and that would be good enough?
>
> >
> > For patch 2, a few routines like vswap_can_swapin_thp seems not
> > needed or should be moved to __swap_cache_alloc? VSWAP_FOLIO is
> > same as swap cache folio check, which is already covered. Same for
> > zero checking, and VSWAP_NONE which is same as swap count check
> > I think. That way we not only save a lot of code, we also no
> > longer need to treat vswap specially.
>
> Unfortunately, I think a lot of this complexity is still needed. Vswap
> adds a new layer, which means new complications :)
>
> For instance, I think you still need vswap_can_swapin_thp. It
> basically enforces that the backend must be something
> swap_read_folio() can handle. That means:
>
> 1. No zswap.
>
> 2. No mixed backend.

If mixed backend means phys vs zero vs zswap, then we already have
part of that covered with the current swap cache except for the phys
part (zswap part seems very doable with fujunjie's work).
swap_cache_alloc_folio will ensure there is no mixed zerobit, it can
be easily extended to ensure there is no mixed zswap as well
(according to what I've learned from fujunjie's code). Similar logic
for phys detection I think.

> > For memcg table allocation, on demand seems a good idea, and actually
> > we are not far from there, I tried to generalize the
> > alloc-then-retry-sleep-alloc in swap_alloc_table but still not generic
> > enough I guess.. Good new is the allocation of the table is already
> > kind of ondemand, just need to split the detection of these two kind
> > of table.
>
> I have a prototype of this, but I have not tested, so I do not want to
> send it out. :)
>
> TLDR is - I still want to record the memcg for vswap (just not charge
> it towards the counter). So we still need memcg_table at both level,
> generally - just not allocating until needed (basically if a physical
> swap slot in the cluster is directly mapped into PTE). You can kinda
> tell, since you pass the folio into the allocation path - with some
> care you can distinguish between:
>
> 1. Virtual swap, or directly maped physical swap -> need memcg_table
>
> 2. Physical swap, backing vswap -> does not memcg_table.
>
> Another alternative is you can defer this allocation until the point
> where you have to do the charging action. But then you have to be
> careful with failure handling, and need to backoff ya di da di da.
> Funsies.
>
> I think I did a mixed of these 2 strategies. Anyway, I'll include the
> patch in v2 (if folks like this approach).
>
> >
> > Mean while I also remember we once discussed about splitting the
> > accounting for vswap / physical swap? If we went that approach we
> > don't need to treat memcg_table specially.
>
> For the charging behavior, I already have a patch for it actually in
> this series (just not the dynamic allocation of the memcg_table field
> yet).
>
> Basically:
>
> 1. For vswap entry, not backed by phys swap: record swap memcg, hold
> reference to pin the memcg, but not charging towards swap.current.

Maybe you don't need to record memcg here since folio->memcg already
have that info?

I previously had a patch:
https://lore.kernel.org/linux-mm/20260220-swap-table-p4-v1-7-104795d19815@xxxxxxxxxxx/

The defers the recording of memcg, the behavior is almost identical to
before, but charging & recording should be cleaner and you don't need
to record memcg at allocation time hence maybe reduce the possibility
of pinning a memcg. I didn't include that in P4 just to reduce LOC,
maybe can be resent or included.

> 2. For phys swap backing vswap: charging towards swap.current, but
> does not record the memcg in its memcg_table, nor does it hold
> reference to memcg (its vswap entry holds the reference already)
>
> 2. For phys swap directly mapped to PTE: charges, records, and holds reference.
>
> The motivation here is I do not want vswap entry to shares the same
> limit as phys swap counter. If we think of it as "infinite" or
> "dynamic", it should not be capped at all, but even if it is charged,
> it should be something separate.

Good to know, it's not too messy to make everything dynamic, but if
our ultimate goal is to account for them separately, maybe we can save
some effort here.

>
> >
> > > - Widen swap_info_struct->max to unsigned long. The vswap device's
> > > max is currently clamped to ALIGN_DOWN(UINT_MAX, SWAPFILE_CLUSTER)
> > > (~16 TiB) to fit in unsigned int. 16 TiB is small for vswap,
> > > especially when we're getting increasingly big machines memory-wise.
> >
> > This should be very easy to do, just replace unsigned int with
> > unsigned long, a lot of place to touch though :)
>
> Agree. I'm just lazy, and this sounds like a simple patch as a
> follow-up. This RFC is already 2000 LoCs - I do not want to burden
> reviewers with extra useless details :)

No problem, good to see progress here :)