Re: [PATCH v5 00/21] Virtual Swap Space

From: Nhat Pham

Date: Mon Mar 23 2026 - 12:12:50 EST

On Mon, Mar 23, 2026 at 6:09 AM Kairui Song <ryncsn@xxxxxxxxx> wrote:
>
> On Sat, Mar 21, 2026 at 3:29 AM Nhat Pham <nphamcs@xxxxxxxxx> wrote:
> > This patch series is based on 6.19. There are a couple more
> > swap-related changes in mainline that I would need to coordinate
> > with, but I still want to send this out as an update for the
> > regressions reported by Kairui Song in [15]. It's probably easier
> > to just build this thing rather than dig through that series of
> > emails to get the fix patch :)
> >
> > Changelog:
> > * v4 -> v5:
> > * Fix a deadlock in memcg1_swapout (reported by syzbot [16]).
> > * Replace VM_WARN_ON(!spin_is_locked()) with lockdep_assert_held(),
> > and use guard(rcu) in vswap_cpu_dead
> > (reported by Peter Zijlstra [17]).
> > * v3 -> v4:
> > * Fix poor swap free batching behavior to alleviate a regression
> > (reported by Kairui Song).
>

Hi Kairui! Thanks a lot for the testing big boss :) I will focus on
the regression in this patch series - we can talk more about
directions in another thread :)

> I tested the v5 (including the batched-free hotfix) and am still
> seeing significant regressions in both sequential and concurrent swap
> workloads
>
> Thanks for the update as I can see It's a lot of thoughtful work.
> Actually I did run some tests already with your previously posted
> hotfix based on v3. I didn't update the result because very
> unfortunately, I still see a major performance regression even with a
> very simple setup.
>
> BTW there seems a simpler way to reproduce that, just use memhog:
> sudo mkswap /dev/pmem0; sudo swapon /dev/pmem0; time memhog 48G; sudo swapoff -a
>
> Before:
> (I'm using fish shell on that test machine so this is fish time format):
> ________________________________________________________
> Executed in 20.80 secs fish external
> usr time 5.14 secs 0.00 millis 5.14 secs
> sys time 15.65 secs 1.17 millis 15.65 secs
> ________________________________________________________
> Executed in 21.69 secs fish external
> usr time 5.31 secs 725.00 micros 5.31 secs
> sys time 16.36 secs 579.00 micros 16.36 secs
> ________________________________________________________
> Executed in 21.86 secs fish external
> usr time 5.39 secs 1.02 millis 5.39 secs
> sys time 16.46 secs 0.27 millis 16.46 secs
>
> After:
> ________________________________________________________
> Executed in 30.77 secs fish external
> usr time 5.16 secs 767.00 micros 5.16 secs
> sys time 25.59 secs 580.00 micros 25.59 secs
> ________________________________________________________
> Executed in 37.47 secs fish external
> usr time 5.48 secs 0.00 micros 5.48 secs
> sys time 31.98 secs 674.00 micros 31.98 secs
> ________________________________________________________
> Executed in 31.34 secs fish external
> usr time 5.22 secs 0.00 millis 5.22 secs
> sys time 26.09 secs 1.30 millis 26.09 secs
>
> It's obviously a lot slower.
>
> pmem may seem rare but SSDs are good at sequential, and memhog uses
> the same filled page and backend like ZRAM has extremely low overhead
> for same filled pages. Results with ZRAM are very similar, and many
> production workloads have massive amounts of samefill memory.
>
> For example on the Android phone I'm using right now at this moment:
> # cat /sys/block/zram0/mm_stat
> 4283899904 1317373036 1370259456 0 1475977216 116457 1991851
> 87273 1793760
> ~450M of samefill page in ZRAM, we may see more on some server
> workload. And I'm seeing similar memhog results with ZRAM, pmem is
> just easier to setup and less noisy. also simulates high speed
> storage.

Interesting. Normally "lots of zero-filled page" is a very beneficial
case for vswap. You don't need a swapfile, or any zram/zswap metadata
overhead - it's a native swap backend. If production workload has this
many zero-filled pages, I think the numbers of vswap would be much
less alarming - perhaps even matching memory overhead because you
don't need to maintain a zram entry metadata (it's at least 2 words
per zram entry right?), while there's no reverse map overhead induced
(so it's 24 bytes on both side), and no need to do zram-side locking
:)

So I was surprised to see that it's not working out very well here. I
checked the implementation of memhog - let me know if this is wrong
place to look:

https://man7.org/linux/man-pages/man8/memhog.8.html
https://github.com/numactl/numactl/blob/master/memhog.c#L52

I think this is what happened here: memhog was populating the memory
0xff, which triggers the full overhead of a swapfile-backed swap entry
because even though it's "same-filled" it's not zero-filled! I was
following Usama's observation - "less than 1% of the same-filled pages
were non-zero" - and so I only handled the zero-filled case here:

https://lore.kernel.org/all/20240530102126.357438-1-usamaarif642@xxxxxxxxx/

This sounds a bit artificial IMHO - as Usama pointed out above, I
think most samefilled pages are zero pages, in real production
workloads. However, if you think there are real use cases with a lot
of non-zero samefilled pages, please let me know I can fix this real
quick. We can support this in vswap with zero extra metadata overhead
- change the VSWAP_ZERO swap entry type to VSWAP_SAME_FILLED, then use
the backend field to store that value. I can send you a patch if
you're interested.

>
> I also ran the previous usemem matrix, which seems better than V3 but
> still pretty bad:
> Test: usemem --init-time -O -n 1 56G, 16G mem, 48G swap, avgs of 8 run.
> Before:
> Throughput (Sum): 528.98 MB/s Throughput (Mean): 526.113333 MB/s Free
> Latency: 3037932.888889
> After:
> Throughput (Sum): 453.74 MB/s Throughput (Mean): 454.875000 MB/s Free
> Latency: 5001144.500000 (~10%, 64% slower)
>
> I'm not sure why our results differ so much — perhaps different LRU
> settings, memory pressure ratios, or THP/mTHP configs? Here's my exact
> config in the attachment. Also includes the full log and info, with
> all debug options disabled for close to production. I ran it 8 times
> and just attached the first result log, it's all similar anyway, my
> test framework reboot the machine after each test run to reduce any
> potential noise.

Ohh interesting - I see that you're testing with MGLRU. I can give that a try.

I'm not enabling THP/mTHP, but I don't see that you're enabling it
either - there's some 2MB swpout but that seems incidental.

Another difference is the swap backend:

1. Regarding pmem backend - I'm not sure if I can get my hands on one
of these, but if you think SSD has the same characteristics maybe I
can give that a try? The problem with SSD is for some reason variance
tends to be pretty high, between iterations yes, but especially across
reboots. Or maybe zram?

2. What about the other numbers below? Are they also on pmem? FTR I
was running most of my benchmarks on zswap, except for one kernel
build benchmark on SSD.

3. Any other backends and setup you're interested in?

BTW, sounds like you have a great benchmark suite - is it open source
somewhere? If not, can you share it with us :) Vswap aside, I think
this would be a good suite to run all swap related changes for every
swap contributor.

Once again, thank you so much for your engagement, Kairui. Very much
appreciated - I owe you a beverage of your choice whenever we meet.
And have a great rest of your day :)