Re: [PATCH v5 00/21] Virtual Swap Space

From: Kairui Song

Date: Thu Apr 16 2026 - 14:49:02 EST

On Wed, Apr 15, 2026 at 1:23 AM Nhat Pham <nphamcs@xxxxxxxxx> wrote:
> Hi Kairui,
>
> My apologies if I missed your response, but could you share with me
> your full benchmark suite? It would be hugely useful, not just for
> this series, but for all swap contributions in the future :) We should
> do as much homework ourselves as possible :P
>
> And apologies for the delayed response. I kept having to back and
> forth between regression investigating, and figuring out what was
> going on with the build setups (I missed some of the CONFIGs you had
> originally), reducing variance on hosts, etc.
>

Hello Nhat!

No worries, I was also thinking about submitting some in tree test for
that so testing will be easier, but got really busy with some other
issues, series and the incommong LSFMM, will find some time to do
that.

>
> 1. Kswapd is slower on the vswap side, which shifts work towards
> direct reclaim, and makes compaction have to run harder (which has a
> weird contention through zsmalloc - I can expand further, but this is
> not vswap-specific, just exacerbated by slower kswapd).

It might be related, e.g. could the dynamic alloc and RCU free of
vswap data cause more fragmentation hence more pressure?

> 2. Higher swap readahead (albeit with higher hit rate) - this is more
> of an artifact of the fact that zero swap pages are no longer backed
> by zram swapfile, which skipped readahead in certain paths. We can
> ignore this for now, but worth assessing this for fast swap backends
> in general (zero swap pages, zswap, so on and so forth).

Hmm... That just brought up another question, you can't tell the
backend type or properly do readahead until you look down through the
virtual layer I guess?

> I spent sometimes perf-ing kswapd, and hack the usemem binary a bit so
> that I can perf the free stage of usemem separately. Most of the
> vswap-specific overhead lies in the xarray lookups. Some big offenders
> on top of my mind:
>
> 1. Right now, in the physical swap allocator, whenever we have an
> allocated slot in the range we're checking, we check if that slot is
> swap-cache-only (i.e no swap count), and if so we try to free it (if
> swapfile is almost full etc.). This check is cheap if all swap entry
> metadata live in physical swap layer only, but more expensive when you
> have to go through another layer of indirection :)
>
> I fixed that by just taking one bit in the reverse map to track
> swap-cache-only state, which eliminates this without extra space
> overhead (on top of the existing design).

Isn't that HAS_CACHE :) ?

> 2. On the free path, in swap_pte_batch(), we check cgroup to make sure
> that the range we pass to free_swap_and_cache_nr() belongs to the same
> cgroup, which has a per-PTE overhead for going to the vswap layer. We

This might be helpful:
https://lore.kernel.org/linux-mm/20260417-swap-table-p4-v2-8-17f5d1015428@xxxxxxxxxxx/

I observed a similar but much smaller issue with the current swap too.
Deferring the cgroup lookup to the swap-cache layer, where we already
grab the cluster (in a later commit), should reduce a lot of overhead.
It requires some unification of allocation though as shown in that
series, things will be much easier after that :)

> Anyway, still a small gap. The next idea that I have is inspired by
> TLB, which cache virtual->physical memory address translation. I added

I think this is getting over complex... You got a mandatory virtual
layer, which already comes with some cluster cache inside, and the
lower physical allocator has its own cluster cache, and then a new set
of cache on top of the virtual layer?

>
> Some final remarks:
> * I still think there's a good chance we can *significantly* close the
> gap overall between a design with virtual swap and a design without.
> It's a bit premature to commit to a vswap-optional route (which to be
> completely honest I'm still not confident is possible to satisfy all
> of our requirements).
>
> * Regardless of the direction we take, these are all pitfalls that
> will be problematic for virtual swap design, and more generally some
> of them will affect any dynamic swap design (which has to go through
> some sort of indirection or a dynamic data structure like xarray that
> will induce some amount of lookup overhead). I hope my work here can
> be useful in this sense too, outside of this specific vswap direction
> :)

Glad to know things are getting better! We can definitely work
something out. But besides the problem above, I think there are some
other concerns that need to be solved too. Good part is I think
everyone agrees that some kind of intermediate layer is needed.