Re: [PATCH v3 1/4] mm/zswap: Make shrink_worker writeback cursor per-memcg

From: Hao Jia

Date: Tue Jun 02 2026 - 07:46:18 EST

On 2026/6/2 01:08, Nhat Pham wrote:

On Mon, Jun 1, 2026 at 4:07 AM Hao Jia <jiahao.kernel@xxxxxxxxx> wrote:

On 2026/5/30 09:24, Yosry Ahmed wrote:

On Tue, May 26, 2026 at 07:45:58PM +0800, Hao Jia wrote:

From: Hao Jia <jiahao1@xxxxxxxxxxx>

The zswap background writeback worker shrink_worker() uses a global
cursor zswap_next_shrink, protected by zswap_shrink_lock, to round-robin
across the online memcgs under root_mem_cgroup.

Proactive writeback also wants a similar per-memcg cursor that is
scoped to the specified memcg, so that repeated invocations against
the same memcg make forward progress across its descendant memcgs
instead of restarting from the first child memcg each time.

Is this a problem in practice?

Is the concern the overhead of scanning memcgs repeatedly, or lack of
fairness? I wonder if we should just do writeback in batches from all
memcgs, similar to how reclaim does it, then evaluate at the end if we
need to start over?

Not using a per-cgroup cursor will cause issues for "repeated
small-budget calls" cases. For example, repeatedly triggering a 2MB
writeback might result in only writing back pages from the first few
child memcgs every time. In the worst-case scenario (where the writeback
amount is less than WB_BATCH), it might only ever write back from the
first child memcg.

Similar to how memory reclaim uses mem_cgroup_iter() (via struct
mem_cgroup_reclaim_iter) and the old shrink_worker() used
zswap_next_shrink, we need a shared cursor here.

I think each proactive reclaim invocation just walk the entire subtree
for page reclaim right (see shrink_node_memcgs())? Would that be
acceptable for you?

Our current approach is very similar to how proactive memory reclaim works in shrink_node_memcgs().
shrink_node_memcgs() first calls memcg = mem_cgroup_iter(target_memcg, NULL, partial);. By doing this, it uses target_memcg->nodeinfo[nid]->iter->position to retrieve the child memcg where the last reclaim left off, and then resumes the iteration.

The catch is that zswap can't just reuse memcg->nodeinfo[nid]->iter->position, as that would mess up the cursor used by the memory reclaim.

I also wonder if we can at least make this structure dynamically
allocated... In a system, you only really invoke proactive reclaim
against a few target cgroups, no?

It is possible to allocate it dynamically, but I am concerned that it might introduce a slight performance overhead. We would need to add a **check** like if (READ_ONCE(memcg->zswap_wb_iter)) every time zswap_mem_cgroup_iter() is called. Furthermore, to handle concurrent allocations, we might also need to introduce cmpxchg() to resolve race conditions.

The additional code would look something like this:

static struct zswap_wb_iter *get_zswap_wb_iter(struct mem_cgroup *memcg)
{
struct zswap_wb_iter *iter, *new_iter;

iter = READ_ONCE(memcg->zswap_wb_iter);
if (likely(iter))
return iter;

new_iter = kzalloc(sizeof(*new_iter), GFP_KERNEL);
if (!new_iter)
return NULL;
spin_lock_init(&new_iter->lock);

if (cmpxchg(&memcg->zswap_wb_iter, NULL, new_iter) != NULL) {
/* Lost the race, someone else installed first. */
kfree(new_iter);
}
return READ_ONCE(memcg->zswap_wb_iter);
}

Thanks,
Hao