Re: [PATCH v3 1/4] mm/zswap: Make shrink_worker writeback cursor per-memcg

From: Hao Jia

Date: Mon Jun 08 2026 - 08:50:38 EST

On 2026/6/5 01:23, Nhat Pham wrote:

On Thu, Jun 4, 2026 at 6:06 AM Hao Jia <jiahao.kernel@xxxxxxxxx> wrote:

On 2026/6/4 13:34, Yosry Ahmed wrote:

For instance, suppose a parent memcg has two children, memcg1 and memcg2,
each with 200MB of zswap (100MB inactive). Triggering proactive writeback on
the parent memcg will exhaust memcg1's inactive zswap pages. After that,
even though memcg2 still has plenty of inactive zswap pages, it will
continue to write back memcg1's active zswap pages. Writing back active
zswap pages causes the user-space agent to prematurely abort the writeback
because it detects that certain memcg metrics have exceeded predefined
thresholds.

This will only happen if the reclaim size is smaller than the batch
size, right? Otherwise the kernel should reclaim more or less equally
from both memcgs?

I gave it some thought. Not using a cursor could lead to unfairness
issues with certain writeback sizes:

- If the writeback size is an odd multiple of WB_BATCH (e.g.,
triggering a writeback of 3 * WB_BATCH), with 2 child cgroups, the
writeback ratio might end up being 2:1.
- If a memcg has 5 child cgroups and a writeback of 2 * WB_BATCH is
triggered, it might repeatedly write back from only the first 2 child
cgroups.

Although setting a smaller WB_BATCH might mitigate this unfairness, it
could hurt writeback efficiency. Let's just use per-memcg cursors to
completely fix these corner cases.

Exactly, the batch size should be small enough that any unfairness is
not a problem. I would honestly just do batching without a per-memcg
cursor, unless we have numbers to prove that the efficiency is
affected when we use a small batch size. Let's only introduce
complexity when needed please.

I'm impartial towards the complexity of per-memcg cursor. I don't
think it's that big of a deal, but only if it's warranted.

Hao, if you're convinced that doing small batch is not efficient,
could you run some experiments to show the improvement bigger batchign
and fairness? Maybe implement a small batch, no-memcg cursor first.
Then implement a patch on top of it to add per-memcg cursor, and show
how much performance win we can get from that patch on top of the
patch series?

Thanks for the suggestion!

I ran some tests and found that neither the per-memcg cursor nor different batch sizes have a significant impact on proactive writeback performance. However, exactly as we suspected, without the per-memcg cursor, the writeback distribution among child memcgs is highly unfair.

Test Setup:

zswap config: 18G capacity, LZ4 compression.
cgroup hierarchy: 1 parent test memcg with 10 child memcgs.
Allocation: Allocated 1600MB of anonymous pages in each child memcg. To ensure compressibility, the first half of each page was filled with random data and the second half with zeros.
Force to zswap: Ran echo "1600M" > memory.reclaim on each child memcg to squeeze all their memory into zswap.
Trigger writeback: Ran echo "<size> zswap_writeback_only" > memory.reclaim on the parent cgroup 200 times, with a 2-second interval between each run.
Metric: Monitored the zswpwb_proactive metric in memory.stat to observe the writeback volume.
**Note**: The size here refers to the uncompressed memory size. Also, since the second-chance algorithm would cause many writebacks to fall short of the target size, I **bypassed** it during these tests to avoid interference.

Without cursor (size: 1M, batch: 32)
child wb_pages wb_MB share%
child0 6368 24.88 12.50
child1 6368 24.88 12.50
child2 6368 24.88 12.50
child3 6368 24.88 12.50
child4 6368 24.88 12.50
child5 6368 24.88 12.50
child6 6368 24.88 12.50
child7 6368 24.88 12.50
child8 0 0.00 0.00
child9 0 0.00 0.00
Without cursor (size: 1M, batch: 128)
child wb_pages wb_MB share%
child0 25472 99.50 50.00
child1 25472 99.50 50.00
child2 0 0.00 0.00
child3 0 0.00 0.00
child4 0 0.00 0.00
child5 0 0.00 0.00
child6 0 0.00 0.00
child7 0 0.00 0.00
child8 0 0.00 0.00
child9 0 0.00 0.00
Without cursor (size: 6M, batch: 128)
child wb_pages wb_MB share%
child0 51200 200.00 16.67
child1 51200 200.00 16.67
child2 25600 100.00 8.33
child3 25600 100.00 8.33
child4 25600 100.00 8.33
child5 25600 100.00 8.33
child6 25600 100.00 8.33
child7 25600 100.00 8.33
child8 25600 100.00 8.33
child9 25600 100.00 8.33

With cursor (size: 1M, batch: 32)
child wb_pages wb_MB share%
child0 5120 20.00 10.00
child1 5120 20.00 10.00
child2 5120 20.00 10.00
child3 5120 20.00 10.00
child4 5120 20.00 10.00
child5 5120 20.00 10.00
child6 5120 20.00 10.00
child7 5120 20.00 10.00
child8 5120 20.00 10.00
child9 5120 20.00 10.00
With cursor (size: 1M, batch: 128)
child wb_pages wb_MB share%
child0 5120 20.00 10.00
child1 5120 20.00 10.00
child2 5120 20.00 10.00
child3 5120 20.00 10.00
child4 5120 20.00 10.00
child5 5120 20.00 10.00
child6 5120 20.00 10.00
child7 5120 20.00 10.00
child8 5120 20.00 10.00
child9 5120 20.00 10.00

Thakns,
Hao