Re: [PATCH v5 00/21] Virtual Swap Space

From: Nhat Pham

Date: Tue Apr 14 2026 - 13:25:38 EST

On Mon, Mar 23, 2026 at 1:05 PM Nhat Pham <nphamcs@xxxxxxxxx> wrote:
>
> On Mon, Mar 23, 2026 at 12:41 PM Kairui Song <ryncsn@xxxxxxxxx> wrote:
> >
> > On Mon, Mar 23, 2026 at 11:33 PM Nhat Pham <nphamcs@xxxxxxxxx> wrote:
> > >
> > > On Mon, Mar 23, 2026 at 6:09 AM Kairui Song <ryncsn@xxxxxxxxx> wrote:
> > > >
> > > > On Sat, Mar 21, 2026 at 3:29 AM Nhat Pham <nphamcs@xxxxxxxxx> wrote:
> > > > > This patch series is based on 6.19. There are a couple more
> > > > > swap-related changes in mainline that I would need to coordinate
> > > > > with, but I still want to send this out as an update for the
> > > > > regressions reported by Kairui Song in [15]. It's probably easier
> > > > > to just build this thing rather than dig through that series of
> > > > > emails to get the fix patch :)
> > > > >
> > > > > Changelog:
> > > > > * v4 -> v5:
> > > > > * Fix a deadlock in memcg1_swapout (reported by syzbot [16]).
> > > > > * Replace VM_WARN_ON(!spin_is_locked()) with lockdep_assert_held(),
> > > > > and use guard(rcu) in vswap_cpu_dead
> > > > > (reported by Peter Zijlstra [17]).
> > > > > * v3 -> v4:
> > > > > * Fix poor swap free batching behavior to alleviate a regression
> > > > > (reported by Kairui Song).
> > > >
> > >
> > > Hi Kairui! Thanks a lot for the testing big boss :) I will focus on
> > > the regression in this patch series - we can talk more about
> > > directions in another thread :)

Hi Kairui,

My apologies if I missed your response, but could you share with me
your full benchmark suite? It would be hugely useful, not just for
this series, but for all swap contributions in the future :) We should
do as much homework ourselves as possible :P

And apologies for the delayed response. I kept having to back and
forth between regression investigating, and figuring out what was
going on with the build setups (I missed some of the CONFIGs you had
originally), reducing variance on hosts, etc.

I don't have PMEM, so I have only worked with zram backend so far. I
did manage to reproduce the regressions you showed me (albeit at a
much smaller gap on certain metrics than your cited numbers, which I
suspect is due to zram/pmem difference).

There are two benchmarks that I focused on:

1. Usemem - the exact command I ran is: time ./usemem --init-time -O
-y -x -n 1 56G

My host is 32GB, 52 processor(s) / x86_64.

Build real (s) vs base sys (s) tput (KB/s)
free_ms
baseline 175.6 +/- 3.6 — 121.9 +/- 3.3 391,941 +/-
8,333 6,992 +/- 204
vss_v5 184.0 +/- 3.9 +4.8% 130.5 +/- 3.8 376,192 +/-
8,581 8,297 +/- 247

(I hope the formatting works, but let me know if it looks weird).

2. Memhog: time memhog 48G

My host for this one is 16 GB, 52 processors, x86_64 too.

Build real (s) vs base sys (s)
baseline 80.5 +/- 1.9 — 62.7 +/- 2.0
vss_v5 83.0 +/- 1.8 +3.1% 65.7 +/- 1.8

On both benchmark, I enable MGLRU, to more closely match the setup you had.

Staring at the run logs (and double check with the logs you sent me to
make sure it's not just on my system), there are some common patterns
I noticed across these runs:

1. Kswapd is slower on the vswap side, which shifts work towards
direct reclaim, and makes compaction have to run harder (which has a
weird contention through zsmalloc - I can expand further, but this is
not vswap-specific, just exacerbated by slower kswapd).

2. Higher swap readahead (albeit with higher hit rate) - this is more
of an artifact of the fact that zero swap pages are no longer backed
by zram swapfile, which skipped readahead in certain paths. We can
ignore this for now, but worth assessing this for fast swap backends
in general (zero swap pages, zswap, so on and so forth).

I spent sometimes perf-ing kswapd, and hack the usemem binary a bit so
that I can perf the free stage of usemem separately. Most of the
vswap-specific overhead lies in the xarray lookups. Some big offenders
on top of my mind:

1. Right now, in the physical swap allocator, whenever we have an
allocated slot in the range we're checking, we check if that slot is
swap-cache-only (i.e no swap count), and if so we try to free it (if
swapfile is almost full etc.). This check is cheap if all swap entry
metadata live in physical swap layer only, but more expensive when you
have to go through another layer of indirection :)

I fixed that by just taking one bit in the reverse map to track
swap-cache-only state, which eliminates this without extra space
overhead (on top of the existing design).

2. On the free path, in swap_pte_batch(), we check cgroup to make sure
that the range we pass to free_swap_and_cache_nr() belongs to the same
cgroup, which has a per-PTE overhead for going to the vswap layer. We
can make this check once-per range instead, to reduce overhead. Even
better - we can skip this check in swap_pte_batch() for the free case,
and deferred this check to later on where we already enter vswap
cluster lock context :)

With a bunch of changes like that, I closed the gap majorly:

usemem:
Build real (s) vs base sys (s) tput (KB/s)
free_ms
baseline 175.6 +/- 3.6 — 121.9 +/- 3.3 391,941 +/-
8,333 6,992 +/- 204
new_opt_v2 179.8 +/- 3.0 +2.4% 126.1 +/- 2.9 382,536 +/-
6,662 7,105 +/- 183

memhog:
Build real (s) vs base sys (s)
baseline 80.5 +/- 1.9 — 62.7 +/- 2.0
new_opt_v2 79.9 +/- 1.7 -0.8% 62.4 +/- 1.7

I would like to also point out that, some of this overhead is specific
to the swapfile backend case, which is why we don't see this in zswap
in the stats I included in V5. Zswap does not require this
swap-cache-only dance, because in virtual swap, zswap only needs the
virtual swap slot as the index (on top of much more negligible space
overhead thanks to zswap tree merging into vswap cluster, no swap
charging, no double allocation, etc.).

Anyway, still a small gap. The next idea that I have is inspired by
TLB, which cache virtual->physical memory address translation. I added
a per-CPU MRU virtual cluster. The idea is that a lot of consecutive
swap operations operate on the same range of swap entries - merging
these operations of course makes the most sense, but sometimes it's
not convenient to do it. The non-vswap, old design sometimes lock the
physical swap cluster and expose the swap cluster struct to callers to
pass around, but I would like to avoid that if possible :)

With this change, we close the gap even further - exceeding the
baseline in average in certain cases, but as you can see it's within
noises so I wouldn't conclude too much out of it:

usemem:
Build real (s) vs base sys (s) tput (KB/s)
free_ms
baseline 175.6 +/- 3.6 — 121.9 +/- 3.3 391,941 +/-
8,333 6,992 +/- 204
cc_v2 176.4 +/- 5.3 +0.4% 123.6 +/- 5.4 390,405 +/-
12,792 6,987 +/- 296

memhog:
Build real (s) vs base sys (s)
baseline 80.5 +/- 1.9 — 62.7 +/- 2.0
cc_v2 79.9 +/- 0.9 -0.8% 62.1 +/- 1.5

The reclaim and compaction stats tell a similar story:

Reclaim / Compaction (usemem)
Metric baseline
vss_v5 new_opt_v2 cc_v2
allocstall 167,787 +/- 10,292 170,532 +/-
15,185 169,782 +/- 9,903 168,635 +/- 13,526
pgsteal_kswapd 6,932,143 +/- 186,411 6,965,962 +/-
288,323 6,968,188 +/- 286,383 7,038,513 +/- 202,696
pgsteal_direct 9,759,350 +/- 480,674 9,978,721 +/-
765,543 9,899,698 +/- 480,781 9,845,668 +/- 544,319
swap_ra 82.9 +/- 22.6 5994.8 +/-
2817.5 4976.8 +/- 1484.2 4718.2 +/- 1510.5
pgmigrate 1,029,901 +/- 428,416 1,687,072 +/-
399,505 1,260,451 +/- 202,603 1,144,560 +/- 490,177

Reclaim / Compaction (memhog)
Metric baseline
vss_v5 new_opt_v2 cc_v2
allocstall 101,245 +/- 6,271 109,320 +/-
12,180 100,207 +/- 11,053 99,223 +/- 9,905
pgsteal_kswapd 8,817,264 +/- 432,519 8,436,548 +/-
265,763 8,728,944 +/- 305,101 8,962,443 +/- 589,012
pgsteal_direct 5,408,046 +/- 394,775 5,932,611 +/-
584,873 5,419,891 +/- 551,226 5,349,352 +/- 601,655
swap_ra 66.5 +/- 22.8 8589.5 +/-
3325.1 8954.5 +/- 2661.9 8703.1 +/- 1746.6
pgmigrate 239,410 +/- 46,014 277,193 +/-
71,487 320,672 +/- 59,488 243,989 +/- 136,129

You can see that the latter versions gradually restore the behaviors
of baseline in terms of reclaim dynamics :)

Some final remarks:
* I still think there's a good chance we can *significantly* close the
gap overall between a design with virtual swap and a design without.
It's a bit premature to commit to a vswap-optional route (which to be
completely honest I'm still not confident is possible to satisfy all
of our requirements).

* Regardless of the direction we take, these are all pitfalls that
will be problematic for virtual swap design, and more generally some
of them will affect any dynamic swap design (which has to go through
some sort of indirection or a dynamic data structure like xarray that
will induce some amount of lookup overhead). I hope my work here can
be useful in this sense too, outside of this specific vswap direction
:)

I will clean things up a bit and send you a v6 for further inspection.
Once again, I'd like to express my gratitude for your engagement and
feedback.