Re: [RFC PATCH 00/14] Virtual Swap Space

From: Nhat Pham
Date: Tue Apr 08 2025 - 11:26:41 EST


On Tue, Apr 8, 2025 at 6:04 AM Usama Arif <usamaarif642@xxxxxxxxx> wrote:
>
>
>
> On 08/04/2025 00:42, Nhat Pham wrote:
> >
> > V. Benchmarking
> >
> > As a proof of concept, I run the prototype through some simple
> > benchmarks:
> >
> > 1. usemem: 16 threads, 2G each, memory.max = 16G
> >
> > I benchmarked the following usemem commands:
> >
> > time usemem --init-time -w -O -s 10 -n 16 2g
> >
> > Baseline:
> > real: 33.96s
> > user: 25.31s
> > sys: 341.09s
> > average throughput: 111295.45 KB/s
> > average free time: 2079258.68 usecs
> >
> > New Design:
> > real: 35.87s
> > user: 25.15s
> > sys: 373.01s
> > average throughput: 106965.46 KB/s
> > average free time: 3192465.62 usecs
> >
> > To root cause this regression, I ran perf on the usemem program, as
> > well as on the following stress-ng program:
> >
> > perf record -ag -e cycles -G perf_cg -- ./stress-ng/stress-ng --pageswap $(nproc) --pageswap-ops 100000
> >
> > and observed the (predicted) increase in lock contention on swap cache
> > accesses. This regression is alleviated if I put together the
> > following hack: limit the virtual swap space to a sufficient size for
> > the benchmark, range partition the swap-related data structures (swap
> > cache, zswap tree, etc.) based on the limit, and distribute the
> > allocation of virtual swap slotss among these partitions (on a per-CPU
> > basis):
> >
> > real: 34.94s
> > user: 25.28s
> > sys: 360.25s
> > average throughput: 108181.15 KB/s
> > average free time: 2680890.24 usecs
> >
> > As mentioned above, I will implement proper dynamic swap range
> > partitioning in a follow up work.
> >
> > 2. Kernel building: zswap enabled, 52 workers (one per processor),
> > memory.max = 3G.
> >
> > Baseline:
> > real: 183.55s
> > user: 5119.01s
> > sys: 655.16s
> >
> > New Design:
> > real: mean: 184.5s
> > user: mean: 5117.4s
> > sys: mean: 695.23s
> >
> > New Design (Static Partition)
> > real: 183.95s
> > user: 5119.29s
> > sys: 664.24s
> >
>
> Hi Nhat,
>
> Thanks for the patches! I have glanced over a couple of them, but this was the main question that came to my mind.
>
> Just wanted to check if you had a look at the memory regression during these benchmarks?
>
> Also what is sizeof(swp_desc)? Maybe we can calculate the memory overhead as sizeof(swp_desc) * swap size/PAGE_SIZE?

Yeah, it's pretty big right now (120 bytes). I haven't done any space
optimization yet - I basically listed out all the required
information, and add one field for each of them. A couple of
optimizations I have in mind:
1. Merged swap_count and in_swapcache (suggested by Yosry).
2. Unionize the rcu field with other fields, because rcu head is only
needed for the free paths (suggested by Shakeel for a different
context, but should be applicable here). Or maybe just remove it and
free the swap descriptors in-context.
3. The type field is really only 2 bits - might be able to squeeze it
in one of the other fields as well.
4. The lock field might not be needed. I think the in_swapcache bit is
already used as a form of "backing storage pinning" mechanism, which
should allow pinners exclusive rights to the backing state.

etc. etc.

The code will get uglier though, so I wanna at least send out one
version with everything separate for clarity sake, before optimizing
them away :)

>
> For a 64G swap that is filled with private anon pages, the overhead in MB might be (sizeof(swp_desc) in bytes * 16M) - 16M (zerobitmap) - 16M*8 (swap map)?

That is true. I will note, however, that in the past the overhead was
static (i.e it is incurred no matter how much swap space you are
using). In fact, you have to often overprovision for swap, so the
overhead goes beyond what you will (ever) need.

Now the overhead is (mostly) dynamic - only incurred on demand, and
reduced when you don't need it.


>
> This looks like a sizeable memory regression?
>
> Thanks,
> Usama
>