Re: [PATCH v3 1/4] mm/zswap: Make shrink_worker writeback cursor per-memcg

From: Hao Jia

Date: Tue Jun 02 2026 - 07:40:15 EST

On 2026/6/2 08:31, Yosry Ahmed wrote:

On Mon, Jun 01, 2026 at 07:07:45PM +0800, Hao Jia wrote:

On 2026/5/30 09:24, Yosry Ahmed wrote:

On Tue, May 26, 2026 at 07:45:58PM +0800, Hao Jia wrote:

From: Hao Jia <jiahao1@xxxxxxxxxxx>

The zswap background writeback worker shrink_worker() uses a global
cursor zswap_next_shrink, protected by zswap_shrink_lock, to round-robin
across the online memcgs under root_mem_cgroup.

Proactive writeback also wants a similar per-memcg cursor that is
scoped to the specified memcg, so that repeated invocations against
the same memcg make forward progress across its descendant memcgs
instead of restarting from the first child memcg each time.

Is this a problem in practice?

Is the concern the overhead of scanning memcgs repeatedly, or lack of
fairness? I wonder if we should just do writeback in batches from all
memcgs, similar to how reclaim does it, then evaluate at the end if we
need to start over?

Not using a per-cgroup cursor will cause issues for "repeated small-budget
calls" cases. For example, repeatedly triggering a 2MB writeback might
result in only writing back pages from the first few child memcgs every
time. In the worst-case scenario (where the writeback amount is less than
WB_BATCH), it might only ever write back from the first child memcg.

Right, so a fairness concern?

I wonder if we should just reclaim a batch from each memcg, then check
if we reached the goal, otherwise start over. If the batch size is small
enough that should work?

Even with a small batch size, for small writeback requests triggered by user-space (e.g., 2MB, which is batch size * N), it might still repeatedly write back from only the first N child memcgs. This could cause the user-space agent to prematurely give up on zswap writeback.

Similar to how memory reclaim uses mem_cgroup_iter() (via struct
mem_cgroup_reclaim_iter) and the old shrink_worker() used zswap_next_shrink,
we need a shared cursor here.

Right, I understand that in theory we need a cursor. I am just wondering
if the complexity is justified in practice. Reclaim is a much larger
beast than zswap writeback. I wonder if we can just get away with
scanning a batch from each child memcg -- for per-memcg reclaim, not
global.

We can always improve it later with a cursor if there's an actual need.

Naturally, group the cursor and its protecting spinlock into a
zswap_wb_iter struct, and make it a member of struct mem_cgroup to
realize per-memcg cursor management. Accordingly, shrink_worker() now
uses the lock and cursor in root_mem_cgroup->zswap_wb_iter.

If we really need to have per-memcg cursors (I am not a big fan), I
think we can minimize the overhead by making the cursor updates use
atomic cmpxchg instead of having a per-memcg lock.

Because mem_cgroup_iter() always calls css_put(&prev->css), we cannot simply
update zswap_wb_iter.pos via cmpxchg() after calling it. Doing so could lead
to a double css_put() issue on prev->css.

Therefore, if we switch to the cmpxchg() approach, we wouldn't be able to
reuse the existing mem_cgroup_iter() logic. We would have to write a new
function similar to cgroup_iter(), and its implementation might end up
looking a bit obscure/complex.

What if we do something like this (for the global cursor):

do {
memcg = xchg(zswap_next_shrink, NULL);
memcg = mem_cgroup_iter(NULL, memcg, NULL);
/* If the cursor was advanced from under us, try again */
if (!try_cmpxchg(zswap_next_shrink, NULL, memcg))
continue;
} while (..);

Regarding the code above, IIRC, both the global and per-cgroup cursors suffer from race conditions. This race can cause mem_cgroup_iter(NULL, NULL, NULL) to return the root memcg or its descendants, leading zswap to write back pages from the wrong memcg.

Additionally, since mem_cgroup_iter() puts the prev memcg ref and gets the next memcg ref, a try_cmpxchg() failure on CPU1 might also lead to a ref leak for memcg1.

CPU1 CPU2
memcg1 = xchg(pos, NULL)
memcg2 = xchg(pos, NULL) memcg2 = NULL;

memcg1 = mem_cgroup_iter()
mem_cgroup_iter(NULL, **NULL**, NULL) error memcg
try_cmpxchg(pos，NULL，memcg2） succeed
try_cmpxchg(pos，NULL，memcg1） **fail**

I took a stab at implementing a cmpxchg()-based zswap_mem_cgroup_iter() modeled after mem_cgroup_iter(), and it actually doesn't look that complex after all :)

Of course, as Nhat mentioned, we definitely need to add plenty of comments for this function.

static struct mem_cgroup *zswap_mem_cgroup_iter(struct mem_cgroup *root)
{
struct cgroup_subsys_state *css;
struct mem_cgroup *pos, *next;

if (mem_cgroup_disabled())
return NULL;
if (!root)
root = root_mem_cgroup;

rcu_read_lock();
restart:
pos = READ_ONCE(root->zswap_wb_iter.pos);
css = pos ? &pos->css : NULL;
next = NULL;

while ((css = css_next_descendant_pre(css, &root->css))) {
if (css_tryget_online(css))
break;
}
next = css ? mem_cgroup_from_css(css) : NULL;

if (cmpxchg(&root->zswap_wb_iter.pos, pos, next) != pos) {
if (next)
css_put(&next->css);
goto restart;
}
rcu_read_unlock();

return next;
}

There is a window where a racing shrinker will see the cursor as NULL
and start over, but that should be fine. We can generalize this for the
per-memcg cursor.

That being said..

Currently, this lock is only used in shrink_memcg(), proactive writeback,
and mem_cgroup_css_offline(). Note that shrink_memcg() only acquires the
lock of the root cgroup, and mem_cgroup_css_offline() is unlikely to be a
hot path.

..this made me realize it's probably fine to just use a global lock for
now?

IIUC the only additional contention to the existing lock will be from
userspace proactive writeback, and that shouldn't be a big deal
especially with the critical section being short?

In the current patch implementation, this lock protects the cgroup's own cursor variable. During each writeback, we only acquire the spin_lock of the target cgroup itself; we do not attempt to **spin on any child cgroup's lock while iterating through the descendants**.

Specifically:

- shrink_memcg() will only attempt to acquire the root cgroup's lock throughout the entire process.
- Proactive writeback will only acquire the lock of the target cgroup **itself**.
- Only mem_cgroup_css_offline() might attempt to hold locks of other cgroups, but normally, this shouldn't be a hot path.

Therefore, even if proactive writebacks are triggered concurrently on a parent cgroup and its child cgroup, there will be **no** lock contention at all (specifically referring to zswap_wb_iter.lock).

Lock contention would only occur if user-space **concurrently** triggers proactive writeback on the exact **same** cgroup. And IIRC, in such a scenario, the bottleneck is more likely to be on other locks anyway.

So, should we keep the spin_lock or go with the cmpxchg() approach?
Yosry and Nhat, what are your thoughts on this?

I think we should experiment with the global lock first. See if you
observe any regressions with workloads that put a lot of pressure on the
lock (a lot of threads in reclaim doing writeback + a few userspace
threads doing proactive writeback). See if the userspace threads
actually cause a meaningful regression.

Sorry, it seems there are some implementation issues with the global lock approach.

In practice, our user-space agent mostly operates in the following two scenarios:
- Triggering proactive writeback on the same cgroup at different times (sequentially).
- Triggering proactive writeback on different cgroups at the same time (concurrently).

In both cases, there is no lock contention. So, the current lock works perfectly fine for us.

However, if we really hate zswap_wb_iter.lock, I can try replacing it with the cmpxchg() approach.

Thanks,
Hao