Re: [PATCH v3 1/4] mm/zswap: Make shrink_worker writeback cursor per-memcg

From: Nhat Pham

Date: Mon Jun 01 2026 - 12:51:01 EST

On Mon, Jun 1, 2026 at 4:07 AM Hao Jia <jiahao.kernel@xxxxxxxxx> wrote:
>
>
>
> On 2026/5/30 09:24, Yosry Ahmed wrote:
> > On Tue, May 26, 2026 at 07:45:58PM +0800, Hao Jia wrote:
> >> From: Hao Jia <jiahao1@xxxxxxxxxxx>
> >>
> >> The zswap background writeback worker shrink_worker() uses a global
> >> cursor zswap_next_shrink, protected by zswap_shrink_lock, to round-robin
> >> across the online memcgs under root_mem_cgroup.
> >>
> >> Proactive writeback also wants a similar per-memcg cursor that is
> >> scoped to the specified memcg, so that repeated invocations against
> >> the same memcg make forward progress across its descendant memcgs
> >> instead of restarting from the first child memcg each time.
> >
> > Is this a problem in practice?
> >
> > Is the concern the overhead of scanning memcgs repeatedly, or lack of
> > fairness? I wonder if we should just do writeback in batches from all
> > memcgs, similar to how reclaim does it, then evaluate at the end if we
> > need to start over?
> >
>
> Not using a per-cgroup cursor will cause issues for "repeated
> small-budget calls" cases. For example, repeatedly triggering a 2MB
> writeback might result in only writing back pages from the first few
> child memcgs every time. In the worst-case scenario (where the writeback
> amount is less than WB_BATCH), it might only ever write back from the
> first child memcg.
>
> Similar to how memory reclaim uses mem_cgroup_iter() (via struct
> mem_cgroup_reclaim_iter) and the old shrink_worker() used
> zswap_next_shrink, we need a shared cursor here.
>
>
> >>
> >> Naturally, group the cursor and its protecting spinlock into a
> >> zswap_wb_iter struct, and make it a member of struct mem_cgroup to
> >> realize per-memcg cursor management. Accordingly, shrink_worker() now
> >> uses the lock and cursor in root_mem_cgroup->zswap_wb_iter.
> >
> > If we really need to have per-memcg cursors (I am not a big fan), I
> > think we can minimize the overhead by making the cursor updates use
> > atomic cmpxchg instead of having a per-memcg lock.
> >
>
> Because mem_cgroup_iter() always calls css_put(&prev->css), we cannot
> simply update zswap_wb_iter.pos via cmpxchg() after calling it. Doing so
> could lead to a double css_put() issue on prev->css.
>
> Therefore, if we switch to the cmpxchg() approach, we wouldn't be able
> to reuse the existing mem_cgroup_iter() logic. We would have to write a
> new function similar to cgroup_iter(), and its implementation might end
> up looking a bit obscure/complex.
>
> Currently, this lock is only used in shrink_memcg(), proactive
> writeback, and mem_cgroup_css_offline(). Note that shrink_memcg() only
> acquires the lock of the root cgroup, and mem_cgroup_css_offline() is
> unlikely to be a hot path.
>
> So, should we keep the spin_lock or go with the cmpxchg() approach?
> Yosry and Nhat, what are your thoughts on this?

TBH, I think the spinlock is simpler at this point if we need to do
all of this explanation to justify correctness of cmpxchg :)

That said, if memcg folks feel like an extra spinlock per cgroup is a
bit much, we can go with the cmpxchg() approach. Please include a FAT
comment explains the compxchg() approach's nuance in the code though.
Speaking from experience, I will forget why it is correct 2 months
after the patch lands :)