Re: [PATCH v3 1/4] mm/zswap: Make shrink_worker writeback cursor per-memcg

From: Nhat Pham

Date: Mon Jun 08 2026 - 12:51:29 EST

On Mon, Jun 8, 2026 at 5:50 AM Hao Jia <jiahao.kernel@xxxxxxxxx> wrote:
> On 2026/6/5 01:23, Nhat Pham wrote:
> >
>
> Thanks for the suggestion!
>
> I ran some tests and found that neither the per-memcg cursor nor
> different batch sizes have a significant impact on proactive writeback
> performance. However, exactly as we suspected, without the per-memcg
> cursor, the writeback distribution among child memcgs is highly unfair.
>
> Test Setup:
>
> zswap config: 18G capacity, LZ4 compression.
> cgroup hierarchy: 1 parent test memcg with 10 child memcgs.
> Allocation: Allocated 1600MB of anonymous pages in each child memcg.
> To ensure compressibility, the first half of each page was filled with
> random data and the second half with zeros.
> Force to zswap: Ran echo "1600M" > memory.reclaim on each child memcg
> to squeeze all their memory into zswap.
> Trigger writeback: Ran echo "<size> zswap_writeback_only" >
> memory.reclaim on the parent cgroup 200 times, with a 2-second interval
> between each run.
> Metric: Monitored the zswpwb_proactive metric in memory.stat to
> observe the writeback volume.
> **Note**: The size here refers to the uncompressed memory size. Also,
> since the second-chance algorithm would cause many writebacks to fall
> short of the target size, I **bypassed** it during these tests to avoid
> interference.
>
> Without cursor (size: 1M, batch: 32)
> child wb_pages wb_MB share%
> child0 6368 24.88 12.50
> child1 6368 24.88 12.50
> child2 6368 24.88 12.50
> child3 6368 24.88 12.50
> child4 6368 24.88 12.50
> child5 6368 24.88 12.50
> child6 6368 24.88 12.50
> child7 6368 24.88 12.50
> child8 0 0.00 0.00
> child9 0 0.00 0.00
> Without cursor (size: 1M, batch: 128)
> child wb_pages wb_MB share%
> child0 25472 99.50 50.00
> child1 25472 99.50 50.00
> child2 0 0.00 0.00
> child3 0 0.00 0.00
> child4 0 0.00 0.00
> child5 0 0.00 0.00
> child6 0 0.00 0.00
> child7 0 0.00 0.00
> child8 0 0.00 0.00
> child9 0 0.00 0.00
> Without cursor (size: 6M, batch: 128)
> child wb_pages wb_MB share%
> child0 51200 200.00 16.67
> child1 51200 200.00 16.67
> child2 25600 100.00 8.33
> child3 25600 100.00 8.33
> child4 25600 100.00 8.33
> child5 25600 100.00 8.33
> child6 25600 100.00 8.33
> child7 25600 100.00 8.33
> child8 25600 100.00 8.33
> child9 25600 100.00 8.33
>
>
> With cursor (size: 1M, batch: 32)
> child wb_pages wb_MB share%
> child0 5120 20.00 10.00
> child1 5120 20.00 10.00
> child2 5120 20.00 10.00
> child3 5120 20.00 10.00
> child4 5120 20.00 10.00
> child5 5120 20.00 10.00
> child6 5120 20.00 10.00
> child7 5120 20.00 10.00
> child8 5120 20.00 10.00
> child9 5120 20.00 10.00
> With cursor (size: 1M, batch: 128)
> child wb_pages wb_MB share%
> child0 5120 20.00 10.00
> child1 5120 20.00 10.00
> child2 5120 20.00 10.00
> child3 5120 20.00 10.00
> child4 5120 20.00 10.00
> child5 5120 20.00 10.00
> child6 5120 20.00 10.00
> child7 5120 20.00 10.00
> child8 5120 20.00 10.00
> child9 5120 20.00 10.00
>

Yeah OTOH, we don't really make fairness an API contract here. When
you set up a proactive reclaim scheme, if you decide to target a
cgroup (and not its children separately), everything underneath it is
fair game to the kernel in any split that we fancy. If you want true
fairness or a desired split, you have to treat them as independent
memory domains and set up proactive reclaim to hit each child cgroup
separately (i.e one "echo > memory.reclaim" for each of them). This is
necessary for example if each child represents a separate, isolated
service/container/tenant. And maybe this is actually what you really
want - hit the ancestor cgroup very lightly for the stuff it owns, but
then dedidcate most of the reclaim effort at the leaf cgroups
independently?

But OTOH, this does seem like a recipe for inefficient reclaim. We
might exhaust hotter memory of a cgroup while sparing colder memory of
another cgroup... But maybe if they're all cold anyway, then who
cares, and eventually you'll get to the cold stuff of other child?

Yosry, what's the concern here? Is it space overhead, or overall code
complexity?