Re: [RFC PATCH 00/14] Virtual Swap Space

From: Johannes Weiner
Date: Tue Apr 08 2025 - 11:54:32 EST


On Tue, Apr 08, 2025 at 02:04:06PM +0100, Usama Arif wrote:
>
>
> On 08/04/2025 00:42, Nhat Pham wrote:
> >
> > V. Benchmarking
> >
> > As a proof of concept, I run the prototype through some simple
> > benchmarks:
> >
> > 1. usemem: 16 threads, 2G each, memory.max = 16G
> >
> > I benchmarked the following usemem commands:
> >
> > time usemem --init-time -w -O -s 10 -n 16 2g
> >
> > Baseline:
> > real: 33.96s
> > user: 25.31s
> > sys: 341.09s
> > average throughput: 111295.45 KB/s
> > average free time: 2079258.68 usecs
> >
> > New Design:
> > real: 35.87s
> > user: 25.15s
> > sys: 373.01s
> > average throughput: 106965.46 KB/s
> > average free time: 3192465.62 usecs
> >
> > To root cause this regression, I ran perf on the usemem program, as
> > well as on the following stress-ng program:
> >
> > perf record -ag -e cycles -G perf_cg -- ./stress-ng/stress-ng --pageswap $(nproc) --pageswap-ops 100000
> >
> > and observed the (predicted) increase in lock contention on swap cache
> > accesses. This regression is alleviated if I put together the
> > following hack: limit the virtual swap space to a sufficient size for
> > the benchmark, range partition the swap-related data structures (swap
> > cache, zswap tree, etc.) based on the limit, and distribute the
> > allocation of virtual swap slotss among these partitions (on a per-CPU
> > basis):
> >
> > real: 34.94s
> > user: 25.28s
> > sys: 360.25s
> > average throughput: 108181.15 KB/s
> > average free time: 2680890.24 usecs
> >
> > As mentioned above, I will implement proper dynamic swap range
> > partitioning in a follow up work.
> >
> > 2. Kernel building: zswap enabled, 52 workers (one per processor),
> > memory.max = 3G.
> >
> > Baseline:
> > real: 183.55s
> > user: 5119.01s
> > sys: 655.16s
> >
> > New Design:
> > real: mean: 184.5s
> > user: mean: 5117.4s
> > sys: mean: 695.23s
> >
> > New Design (Static Partition)
> > real: 183.95s
> > user: 5119.29s
> > sys: 664.24s
> >
>
> Hi Nhat,
>
> Thanks for the patches! I have glanced over a couple of them, but this was the main question that came to my mind.
>
> Just wanted to check if you had a look at the memory regression during these benchmarks?
>
> Also what is sizeof(swp_desc)? Maybe we can calculate the memory overhead as sizeof(swp_desc) * swap size/PAGE_SIZE?
>
> For a 64G swap that is filled with private anon pages, the overhead in MB might be (sizeof(swp_desc) in bytes * 16M) - 16M (zerobitmap) - 16M*8 (swap map)?
>
> This looks like a sizeable memory regression?

One thing to keep in mind is that the swap descriptor is currently
blatantly explicit, and many conversions and optimizations have not
been done yet. There are some tradeoffs made here regarding code
reviewability, but I agree it makes it hard to see what this would
look like fully realized.

I think what's really missing is an analysis of what the goal is and
what the overhead will be then.

The swapin path currently consults the swapcache, then the zeromap,
then zswap, and finally the backend. The external swap_cgroup array is
consulted to determine who to charge for the new page.

With vswap, the descriptor is looked up and resolves to a type,
location, cgroup ownership, a refcount. This means it replaces the
swapcache, the zeromap, the cgroup map, and largely the swap_map.

Nhat was not quite sure yet if the swap_map can be a single bit per
entry or two bits to represent bad slots. In any case, it's a large
reduction in static swap space overhead, and eliminates the tricky
swap count continuation code.