Re: [PATCH v5 00/21] Virtual Swap Space

From: Nhat Pham

Date: Mon Apr 20 2026 - 12:05:26 EST

On Mon, Mar 23, 2026 at 3:09 AM Kairui Song <ryncsn@xxxxxxxxx> wrote:
>
> On Sat, Mar 21, 2026 at 3:29 AM Nhat Pham <nphamcs@xxxxxxxxx> wrote:
> > This patch series is based on 6.19. There are a couple more
> > swap-related changes in mainline that I would need to coordinate
> > with, but I still want to send this out as an update for the
> > regressions reported by Kairui Song in [15]. It's probably easier
> > to just build this thing rather than dig through that series of
> > emails to get the fix patch :)
> >
> > Changelog:
> > * v4 -> v5:
> > * Fix a deadlock in memcg1_swapout (reported by syzbot [16]).
> > * Replace VM_WARN_ON(!spin_is_locked()) with lockdep_assert_held(),
> > and use guard(rcu) in vswap_cpu_dead
> > (reported by Peter Zijlstra [17]).
> > * v3 -> v4:
> > * Fix poor swap free batching behavior to alleviate a regression
> > (reported by Kairui Song).
>
> I tested the v5 (including the batched-free hotfix) and am still
> seeing significant regressions in both sequential and concurrent swap
> workloads
>
> Thanks for the update as I can see It's a lot of thoughtful work.
> Actually I did run some tests already with your previously posted
> hotfix based on v3. I didn't update the result because very
> unfortunately, I still see a major performance regression even with a
> very simple setup.
>
> BTW there seems a simpler way to reproduce that, just use memhog:
> sudo mkswap /dev/pmem0; sudo swapon /dev/pmem0; time memhog 48G; sudo swapoff -a
>
> Before:
> (I'm using fish shell on that test machine so this is fish time format):
> ________________________________________________________
> Executed in 20.80 secs fish external
> usr time 5.14 secs 0.00 millis 5.14 secs
> sys time 15.65 secs 1.17 millis 15.65 secs
> ________________________________________________________
> Executed in 21.69 secs fish external
> usr time 5.31 secs 725.00 micros 5.31 secs
> sys time 16.36 secs 579.00 micros 16.36 secs
> ________________________________________________________
> Executed in 21.86 secs fish external
> usr time 5.39 secs 1.02 millis 5.39 secs
> sys time 16.46 secs 0.27 millis 16.46 secs
>
> After:
> ________________________________________________________
> Executed in 30.77 secs fish external
> usr time 5.16 secs 767.00 micros 5.16 secs
> sys time 25.59 secs 580.00 micros 25.59 secs
> ________________________________________________________
> Executed in 37.47 secs fish external
> usr time 5.48 secs 0.00 micros 5.48 secs
> sys time 31.98 secs 674.00 micros 31.98 secs
> ________________________________________________________
> Executed in 31.34 secs fish external
> usr time 5.22 secs 0.00 millis 5.22 secs
> sys time 26.09 secs 1.30 millis 26.09 secs
>
> It's obviously a lot slower.
>
> pmem may seem rare but SSDs are good at sequential, and memhog uses
> the same filled page and backend like ZRAM has extremely low overhead
> for same filled pages. Results with ZRAM are very similar, and many
> production workloads have massive amounts of samefill memory.
>
> For example on the Android phone I'm using right now at this moment:
> # cat /sys/block/zram0/mm_stat
> 4283899904 1317373036 1370259456 0 1475977216 116457 1991851
> 87273 1793760
> ~450M of samefill page in ZRAM, we may see more on some server
> workload. And I'm seeing similar memhog results with ZRAM, pmem is
> just easier to setup and less noisy. also simulates high speed
> storage.
>
> I also ran the previous usemem matrix, which seems better than V3 but
> still pretty bad:
> Test: usemem --init-time -O -n 1 56G, 16G mem, 48G swap, avgs of 8 run.
> Before:
> Throughput (Sum): 528.98 MB/s Throughput (Mean): 526.113333 MB/s Free
> Latency: 3037932.888889
> After:
> Throughput (Sum): 453.74 MB/s Throughput (Mean): 454.875000 MB/s Free
> Latency: 5001144.500000 (~10%, 64% slower)
>
> I'm not sure why our results differ so much — perhaps different LRU
> settings, memory pressure ratios, or THP/mTHP configs? Here's my exact
> config in the attachment. Also includes the full log and info, with
> all debug options disabled for close to production. I ran it 8 times
> and just attached the first result log, it's all similar anyway, my
> test framework reboot the machine after each test run to reduce any
> potential noise.
>
> And the above tests are only about sequential performance, concurrent
> ones seem worse:
> Test: usemem --init-time -O -R -n 32 622M, 16G mem, 48G swap, avgs of 8 run.
> Before:
> Throughput (Sum): 5467.51 MB/s Throughput (Mean): 170.04 MB/s Free
> Latency: 28648.65
> After:
> Throughput (Sum): 4914.86 MB/s Throughput (Mean): 152.74 MB/s Free
> Latency: 67789.81 (~10%, 230% slower)

For this test case, I took my 16G (a bit less than that technically)
52 cores host, using zram as the backend and MGLRU, for a spin.

Keeping the same parameters as your usemem command, unfortunately, led
to massive thrashing (even with baseline kernel) - unfortunately zram
still used physical memory so the overcommit level is too large
(especially with random access pattern, i.e the -R flag).

I then tried reducing the 622M part to 480M, but the problem with that
is VSS5 did not show any regression - probably because the
overcommitting is too low, or not enough concurrency. I had to push
the concurrency up to 52 workers, allocating 300M each (which is
slightly more memory allocated overall than the 480 x 32 case), to
finally show the regression you reported. Variance was very big with 8
runs though (what I normally use for usemem these days), so I had to
do 20 runs per kernel - fortunately these runs are fast:

Metric baseline vss_v5 new_opt_v2 cc_v2
real (s) 15.0 +/- 0.8 18.3 +/- 1.8 15.1 +/- 1.0 14.7 +/- 1.0
sys (s) 396.4 +/- 31.1 511.9 +/- 60.3 404.1 +/- 34.5 392.4 +/- 39.9
tput (KB/s) 28188 +/- 6996 23287 +/- 6629 27999 +/- 6623 28744 +/- 7015
free (ms) 101.1 +/- 52.4 91.4 +/- 41.5 93.1 +/- 43.8 97.6 +/- 49.5
% real n/a +22.4% +0.7% -1.7%
% sys n/a +29.1% +1.9% -1.0%
% tput n/a -17.4% -0.7% +2.0%
% free n/a -9.6% -7.9% -3.5%

(I realized I mangled the output last time of the "memory reclaim
metrics table" table due to auto line break. Let's hope this is
better).

Strangely, no free regression. Hmmm.

But real, sys, and throughput regression are real. The optimizations
do close the gap to within noise level here too.