Re: [PATCH v3 1/4] mm/zswap: Make shrink_worker writeback cursor per-memcg

From: Yosry Ahmed

Date: Mon Jun 08 2026 - 12:51:19 EST

On Mon, Jun 8, 2026 at 9:23 AM Nhat Pham <nphamcs@xxxxxxxxx> wrote:
>
> On Mon, Jun 8, 2026 at 5:50 AM Hao Jia <jiahao.kernel@xxxxxxxxx> wrote:
> > On 2026/6/5 01:23, Nhat Pham wrote:
> > >
> >
> > Thanks for the suggestion!
> >
> > I ran some tests and found that neither the per-memcg cursor nor
> > different batch sizes have a significant impact on proactive writeback
> > performance. However, exactly as we suspected, without the per-memcg
> > cursor, the writeback distribution among child memcgs is highly unfair.
> >
> > Test Setup:
> >
> > zswap config: 18G capacity, LZ4 compression.
> > cgroup hierarchy: 1 parent test memcg with 10 child memcgs.
> > Allocation: Allocated 1600MB of anonymous pages in each child memcg.
> > To ensure compressibility, the first half of each page was filled with
> > random data and the second half with zeros.
> > Force to zswap: Ran echo "1600M" > memory.reclaim on each child memcg
> > to squeeze all their memory into zswap.
> > Trigger writeback: Ran echo "<size> zswap_writeback_only" >
> > memory.reclaim on the parent cgroup 200 times, with a 2-second interval
> > between each run.
> > Metric: Monitored the zswpwb_proactive metric in memory.stat to
> > observe the writeback volume.
> > **Note**: The size here refers to the uncompressed memory size. Also,
> > since the second-chance algorithm would cause many writebacks to fall
> > short of the target size, I **bypassed** it during these tests to avoid
> > interference.
> >
> > Without cursor (size: 1M, batch: 32)
> > child wb_pages wb_MB share%
> > child0 6368 24.88 12.50
> > child1 6368 24.88 12.50
> > child2 6368 24.88 12.50
> > child3 6368 24.88 12.50
> > child4 6368 24.88 12.50
> > child5 6368 24.88 12.50
> > child6 6368 24.88 12.50
> > child7 6368 24.88 12.50
> > child8 0 0.00 0.00
> > child9 0 0.00 0.00
> > Without cursor (size: 1M, batch: 128)
> > child wb_pages wb_MB share%
> > child0 25472 99.50 50.00
> > child1 25472 99.50 50.00
> > child2 0 0.00 0.00
> > child3 0 0.00 0.00
> > child4 0 0.00 0.00
> > child5 0 0.00 0.00
> > child6 0 0.00 0.00
> > child7 0 0.00 0.00
> > child8 0 0.00 0.00
> > child9 0 0.00 0.00
> > Without cursor (size: 6M, batch: 128)
> > child wb_pages wb_MB share%
> > child0 51200 200.00 16.67
> > child1 51200 200.00 16.67
> > child2 25600 100.00 8.33
> > child3 25600 100.00 8.33
> > child4 25600 100.00 8.33
> > child5 25600 100.00 8.33
> > child6 25600 100.00 8.33
> > child7 25600 100.00 8.33
> > child8 25600 100.00 8.33
> > child9 25600 100.00 8.33
> >
> >
> > With cursor (size: 1M, batch: 32)
> > child wb_pages wb_MB share%
> > child0 5120 20.00 10.00
> > child1 5120 20.00 10.00
> > child2 5120 20.00 10.00
> > child3 5120 20.00 10.00
> > child4 5120 20.00 10.00
> > child5 5120 20.00 10.00
> > child6 5120 20.00 10.00
> > child7 5120 20.00 10.00
> > child8 5120 20.00 10.00
> > child9 5120 20.00 10.00
> > With cursor (size: 1M, batch: 128)
> > child wb_pages wb_MB share%
> > child0 5120 20.00 10.00
> > child1 5120 20.00 10.00
> > child2 5120 20.00 10.00
> > child3 5120 20.00 10.00
> > child4 5120 20.00 10.00
> > child5 5120 20.00 10.00
> > child6 5120 20.00 10.00
> > child7 5120 20.00 10.00
> > child8 5120 20.00 10.00
> > child9 5120 20.00 10.00

Yes, the per-memcg cursor is more fair, and you can synthesize
scenarios that show that. However, I don't think this is a problem in
practice:

1. The unfairness is limited to the batch size per-invocation. If the
batch size is 128 pages (your highest one here), that's 0.5 MB (on
x86), which is fairly low? If the batch size is 32, it's even less.

2. Realistically, if you have a parent cgroup with with >10G of
memory, you wouldn't be reclaiming in steps of 1M. If you want to
reclaim 200MB, why are you doing it over 200 invocations? If you do it
in a single one (or over a few retries) the shares should become much
more even.

We're trying to fix a practical use case, not finding reasons why a
simple implementation won't work -- right?

More below (to Nhat's point).

>
> Yeah OTOH, we don't really make fairness an API contract here. When
> you set up a proactive reclaim scheme, if you decide to target a
> cgroup (and not its children separately), everything underneath it is
> fair game to the kernel in any split that we fancy. If you want true
> fairness or a desired split, you have to treat them as independent
> memory domains and set up proactive reclaim to hit each child cgroup
> separately (i.e one "echo > memory.reclaim" for each of them). This is
> necessary for example if each child represents a separate, isolated
> service/container/tenant. And maybe this is actually what you really
> want - hit the ancestor cgroup very lightly for the stuff it owns, but
> then dedidcate most of the reclaim effort at the leaf cgroups
> independently?

I would go a bit farther and claim that ideally fairness shouldn't
even be a factor. If you invoke proactive reclaim on a parent cgroup
with 100MB, you want to reclaim the coldest 100MB in that parent, no
matter what child they reside in. If one child cgroup has 100% hot
memory and one child cgroup has 100% cold memory, ideally you'd
reclaim all the cold memory from the second child.

However, the implementation of the LRUs and the coldness tracking
doesn't allow for doing this, so we "fallback" to reclaiming in
batches from each child because we don't really know where the coldest
pages overall are. If that changes in the future (somehow), I argue
that the correct thing to do is reclaim the absolute coldest memory at
the parent level.

If you want to reclaim evenly among the children, you can do that and
directly reclaim from the children.

>
> But OTOH, this does seem like a recipe for inefficient reclaim. We
> might exhaust hotter memory of a cgroup while sparing colder memory of
> another cgroup... But maybe if they're all cold anyway, then who
> cares, and eventually you'll get to the cold stuff of other child?
>
> Yosry, what's the concern here? Is it space overhead, or overall code
> complexity?

Mostly the complexity (e.g. the zombie memcg cleanup) and a tiny bit
the unnecessary space (8 bytes is not a lot, but these things add up).