Re: [RFC PATCH 0/5] mm, swap: Virtual Swap Space (Swap Table Edition)

From: Nhat Pham

Date: Tue Jun 02 2026 - 11:40:26 EST

On Mon, Jun 1, 2026 at 8:25 PM Kairui Song <ryncsn@xxxxxxxxx> wrote:
>
> On Tue, Jun 2, 2026 at 2:06 AM Nhat Pham <nphamcs@xxxxxxxxx> wrote:
> >
> > On Mon, Jun 1, 2026 at 10:45 AM Kairui Song <ryncsn@xxxxxxxxx> wrote:
> > >
> > > On Mon, Jun 1, 2026 at 11:57 PM Nhat Pham <nphamcs@xxxxxxxxx> wrote:
> > > >
> > > > Are you suggesting we merge the virtual table with main swap table?
> > > >
> > > > Man, I'd love to do this. There is a problem though - we have a case
> > > > where we occupy both backing physical swap AND swap cache. Do you
> > > > think we can fit both the physical swap slot handle and the swap cache
> > > > PFN into the same slot in virtual table? Maybe with some expanding...?
> > >
> > > I don't really get why we would need to do that? If you put the PFN
> > > info in the virtual / upper layer, then the count info, locking, and
> > > all swap IO synchronization (via folio lock), dup (current protected
> > > by ci lock / folio lock), and allocation (folio_alloc_swap), are all
> > > handled in this layer.
> > >
> > > The physical / lower layer will just hold a reverse entry on
> > > folio_realloc_swap, or no entry at all (no physical layer used, zswap,
> > > or after swap allocation but before IO) right?
> > >
> > > Looking up the actual folio from the physical layer will be a bit
> > > slower since it needs to resolve the reverse entry, but the only place
> > > we need to do that is things like migrate, compaction (none of them
> > > exist yet) which seems totally fine?
> >
> > All of this is correct, but consider swaping in a vswap entry backed
> > by pswap. There are cases where you still want to maintain the pswap
> > slots around backing vswap entry, while having the swap cache folio as
> > well.
> >
> > For e.g, at swap in time, we add the folio into the swap cache. First
> > of all, we need to hold on to the physical swap slot for IO step. But
> > even after IO succeeds, there are cases where you would still like to
> > keep physical swap slots around (for e.g, to avoid swapping out again
> > if the folio is only speculatively fetched).
>
> A reverse entry is enough to hold the physical swap, just like how the
> current hibernation works with a fake shadow, you don't need a PFN
> just for holding that.
>
> >
> > So you have to make sure we have space for both the physical swap
> > slot, and the swap cache folio's PFN at the same time for each vswap
> > entry. So we still need the vtable extension (well maybe the other
> > approach I mentioned could work, but I'm not 100% sure).
>
> Right, vtable extension is fine, there is no redundant data. I just
> mean you don't need to set the PFN twice (for vswap & pswap). So
> simply reusing the PFN format in the vswap layer and solving
> everything there should be enough.

Ah yeah, then I might have misunderstood you here. I thought you were
proposing a way to remove vtable :)

"don't need to set the PFN twice" completely agree. I'm pretty sure I
did not here, but do let me know if I accidentally set it twice. I'm
be sure to check this myself for the next version.

>
> > > Thanks. Not too complicated, actually our internal kernel
> > > implementation still using si->percpu cluster, and use a counter for
> > > the rotation and each order have a counter :P, it's a bit ugly but
> > > works fine. It still serves pretty well just like the global percpu
> > > cluster, YoungJun's previous per ci percpu cluster also still provides
> > > the fast path, many ways to do that.
> >
> > Sounds like something that should be upstreamed? ;)
>
> I'd love to :), there is a lot of work going on as you can see and
> people seem to have many different proposals about this so I didn't
> prioritize it. I'll try as things settle down.

Yeah understandable. It's a very volatile codebase, with a lot of
folks trying to improve different aspects.

Hopefully we're close to a unified design :)

I'll keep my dedicated vswap per-cpu alloc caching for now, but I'll
get rid of it whenever the per-CPU per-si cache is ready.

>
> > > > >
> > > > > For patch 2, a few routines like vswap_can_swapin_thp seems not
> > > > > needed or should be moved to __swap_cache_alloc? VSWAP_FOLIO is
> > > > > same as swap cache folio check, which is already covered. Same for
> > > > > zero checking, and VSWAP_NONE which is same as swap count check
> > > > > I think. That way we not only save a lot of code, we also no
> > > > > longer need to treat vswap specially.
> > > >
> > > > Unfortunately, I think a lot of this complexity is still needed. Vswap
> > > > adds a new layer, which means new complications :)
> > > >
> > > > For instance, I think you still need vswap_can_swapin_thp. It
> > > > basically enforces that the backend must be something
> > > > swap_read_folio() can handle. That means:
> > > >
> > > > 1. No zswap.
> > > >
> > > > 2. No mixed backend.
> > >
> > > If mixed backend means phys vs zero vs zswap, then we already have
> > > part of that covered with the current swap cache except for the phys
> > > part (zswap part seems very doable with fujunjie's work).
> > > swap_cache_alloc_folio will ensure there is no mixed zerobit, it can
> > > be easily extended to ensure there is no mixed zswap as well
> > > (according to what I've learned from fujunjie's code). Similar logic
> > > for phys detection I think.
> >
> > Yeah it's basically generalizing that check, and handle the case where
> > we can have indirection.
> >
> > I mean I can open-code it, but it has to be there :) And I figure it
> > might be useful to check this opportunistically (at swap_pte_batch,
> > even if it's not guaranteed to be correct down the line) before we
> > even attempt to allocate a large folio etc. to avoid large folio
> > allocation.
>
> Right, but swap_cache_alloc_folio with orders=<large order> won't
> attempt a large allocation if the batch check fails, so that's fine.
>
> > > > Basically:
> > > >
> > > > 1. For vswap entry, not backed by phys swap: record swap memcg, hold
> > > > reference to pin the memcg, but not charging towards swap.current.
> > >
> > > Maybe you don't need to record memcg here since folio->memcg already
> > > have that info?
> > >
> > > I previously had a patch:
> > > https://lore.kernel.org/linux-mm/20260220-swap-table-p4-v1-7-104795d19815@xxxxxxxxxxx/
> > >
> > > The defers the recording of memcg, the behavior is almost identical to
> > > before, but charging & recording should be cleaner and you don't need
> > > to record memcg at allocation time hence maybe reduce the possibility
> > > of pinning a memcg. I didn't include that in P4 just to reduce LOC,
> > > maybe can be resent or included.
> >
> > That works-ish when the folio is sitll in swap cache, but say if it's
> > vswap backed by zswap (and the swap cache folio has been reclaimed),
> > you need a place to store the memcg, no?
>
> "Backed by zswap" means the actual swapout already happened, which is
> the case where we always have to record the memcg info because the
> folio is gone, seems still fit in the model.

Hmmm I might have misunderstood you in my last response here.

So what you are doing in that patch:

1. Charge towards folio->memcg when we allocate swap slots, but do not
record or take reference yet.

2. Once we reclaimed the folio after swap out, then we record and
acquire reference to pin.

You know what - this would simplify my usecase. For vswap entries not
backing by pswap, it *basically* just means I skip step 1 for vswap
backend. Step 2 is shared for all cases. Donezo.

You're right. This is simpler :) Let me brew on it a bit longer in
case there might be something we're missing. but it does seem like
this will reduce complexity (and with the added benefits of me not
having to come up with names for helpers).

>
> > Just seems cleaner to centralize this info at vswap layer when it is
> > presented, for now anyway, rather than juggling this on a per-backend
> > basis.
>
> Zswap charge could be merged with vswap I think but pswap we just
> discussed that we might want to charge it differently? And actually
> vswap charge is still quite different from zswap charge if you want to
> make vswap infinitely large? I think we can figure out this part as we
> progress; it's not a major problem at this point.

That was because I misunderstood your suggestions. My bad :)

Anyway, please keep the suggestions and recommendations coming :) I'm
playing with some of your suggestions right now, and waiting for other
folks' inputs as well. Will send out the next version at some point.
If there is no fundamental design flaws, I will un-RFC once I've
addressed all the main issues.