Re: [PATCH 0/8] mm/zswap, zsmalloc: Per-memcg-lruvec zswap accounting

From: Nhat Pham

Date: Tue Mar 03 2026 - 13:06:20 EST


On Tue, Mar 3, 2026 at 9:51 AM Joshua Hahn <joshua.hahnjy@xxxxxxxxx> wrote:
>
> On Mon, 2 Mar 2026 13:31:32 -0800 Nhat Pham <nphamcs@xxxxxxxxx> wrote:
>
> > On Thu, Feb 26, 2026 at 11:29 AM Joshua Hahn <joshua.hahnjy@xxxxxxxxx> wrote:
>
> [...snip...]
>
> > > Introduce a new per-zpdesc array of objcg pointers to track
> > > per-memcg-lruvec memory usage by zswap, while leaving zram users
> > > unaffected.
>
> [...snip...]
>
> Hi Nhat! I hope you are doing well : -) Thank you for taking a look!
>
> > I might have missed it and this might be in one of the latter patches,
> > but could also add some quick and dirty benchmark for zswap to ensure
> > there's no or minimal performance implications? IIUC there is a small
> > amount of extra overhead in certain steps, because we have to go
> > through zsmalloc to query objcg. Usemem or kernel build should suffice
> > IMHO.
>
> Yup, this was one of my concerns too. I tried to do a somewhat comprehensive
> analysis below, hopefully this can show a good picture of what's happening.
> Spoilers: there doesn't seem to be any significant regressions (< 1%)
> and any regressions are within a small fraction of the standard deviation.
>
> One thing that I have noticed is that there is a tangible reduction in
> standard deviation for some of these benchmarks. I can't exactly pinpoint
> why this is happening, but I'll take it as a win :p
>
> > To be clear, I don't anticipate any observable performance change, but
> > it's a good sanity check :) Besides, can't be too careful with stress
> > testing stuff :P
>
> For sure. I should have done these and included it in the original RFC,
> but I think I might have been too eager to get the RFC out : -)
> Will include in the second version of the series!
>
> All the experiments below are done on a 2-NUMA system. The data is quite
> compressible, which I think makes sense for measuring the overhead of accounting.
>
> Benchmark 1
> Allocating 2G memory to one node with 1G memory.high. Average across 10 trials
> +-------------------------+---------+----------+
> | | average | stddev |
> +-------------------------+---------+----------+
> | Baseline (11439c4635ed) | 8887.82 | 362.40 |
> | Baseline + Series | 8944.16 | 356.45 |
> +-------------------------+---------+----------+
> | Delta | +0.634% | -1.642% |
> +-------------------------+---------+----------+
>
> Benchmark 2
> Allocating 2G memory to one node with 1G memory.high, churn 5x through the
> memory. Average across 5 trials.
> +-------------------------+----------+----------+
> | | average | stddev |
> +-------------------------+----------+----------+
> | Baseline (11439c4635ed) | 31152.96 | 166.23 |
> | Baseline + Series | 31355.28 | 64.86 |
> +-------------------------+----------+----------+
> | Delta | +0.649% | -60.981% |
> +-------------------------+----------+----------+
>
> Benchmark 3
> Allocating 2G memory to one node with 1G memory.high, split across 2 nodes.
> Average across 5 trials.
> +-------------------------+---------+----------+
> | a | average | stddev |
> +-------------------------+---------+----------+
> | Baseline (11439c4635ed) | 16101.6 | 174.18 |
> | Baseline + Series | 16022.4 | 117.17 |
> +-------------------------+---------+----------+
> | Delta | -0.492% | -32.731% |
> +-------------------------+---------+----------+
>
> Benchmark 4
> Reading stat files 10000 times under memory pressure
>
> memory.stat
> +-------------------------+---------+----------+
> | | average | stddev |
> +-------------------------+---------+----------+
> | Baseline (11439c4635ed) | 24524.4 | 501.7 |
> | Baseline + Series | 24807.2 | 444.53 |
> +-------------------------+---------+---------+
> | Delta | 1.153% | -11.395% |
> +-------------------------+---------+----------+
>
> memory.numa_stat
> +-------------------------+---------+---------+
> | | average | stddev |
> +-------------------------+---------+---------+
> | Baseline (11439c4635ed) | 24807.2 | 444.53 |
> | Baseline + Series | 23837.6 | 521.68 |
> +-------------------------+---------+---------+
> | Delta | -3.905% | 17.355% |
> +-------------------------+---------+---------+
>
> proc/vmstat
> +-------------------------+---------+----------+
> | | average | stddev |
> +-------------------------+---------+----------+
> | Baseline (11439c4635ed) | 24793.6 | 285.26 |
> | Baseline + Series | 23815.6 | 553.44 |
> +-------------------------+---------+---------+
> | Delta | -3.945% | +94.012% |
> +-------------------------+---------+----------+
>
> ^^^ Some big increase in standard deviation here, although there is some
> decrease in the average time. Probably the most notable change that I've seen
> from this patch.
>
> node0/vmstat
> +-------------------------+---------+----------+
> | a | average | stddev |
> +-------------------------+---------+----------+
> | Baseline (11439c4635ed) | 24541.4 | 281.41 |
> | Baseline + Series | 24479 | 241.29 |
> +-------------------------+---------+---------+
> | Delta | -0.254% | -14.257% |
> +-------------------------+---------+----------+
>
> Lots of testing results, I think mostly negligible in terms of average, but
> some non-negligible changes in standard deviation going in both directions.
> I don't see anything too concerning off the top of my head, but for the
> next version I'll try to do some more testing across different machines
> as well (I don't have any machines with > 2 nodes, but maybe I can do
> some tests on QEMU just to sanity check)
>
> Thanks again, Nhat. Have a great day!
> Joshua

Sounds like any meagre performance difference is smaller than noise :P
If it's this negligible on these microbenchmarks, I think they'll be
infinitesimal in production workloads where these operations are a
very small part.

Kinda makes sense, because objcgroup access is only done in very small
subsets of operations: zswap entry store and zswap entry free, which
can only happen once each per zswap entry.

I think we're fine, but I'll leave other reviewers comment on it as well.