Re: [PATCH v3 1/4] mm/zswap: Make shrink_worker writeback cursor per-memcg

From: Yosry Ahmed

Date: Wed Jun 03 2026 - 13:57:22 EST

On Wed, Jun 03, 2026 at 11:02:54AM +0800, Hao Jia wrote:
>
>
> On 2026/6/3 07:19, Yosry Ahmed wrote:
> > > > > > > Proactive writeback also wants a similar per-memcg cursor that is
> > > > > > > scoped to the specified memcg, so that repeated invocations against
> > > > > > > the same memcg make forward progress across its descendant memcgs
> > > > > > > instead of restarting from the first child memcg each time.
> > > > > >
> > > > > > Is this a problem in practice?
> > > > > >
> > > > > > Is the concern the overhead of scanning memcgs repeatedly, or lack of
> > > > > > fairness? I wonder if we should just do writeback in batches from all
> > > > > > memcgs, similar to how reclaim does it, then evaluate at the end if we
> > > > > > need to start over?
> > > > > >
> > > > >
> > > > > Not using a per-cgroup cursor will cause issues for "repeated small-budget
> > > > > calls" cases. For example, repeatedly triggering a 2MB writeback might
> > > > > result in only writing back pages from the first few child memcgs every
> > > > > time. In the worst-case scenario (where the writeback amount is less than
> > > > > WB_BATCH), it might only ever write back from the first child memcg.
> > > >
> > > > Right, so a fairness concern?
> > > >
> > > > I wonder if we should just reclaim a batch from each memcg, then check
> > > > if we reached the goal, otherwise start over. If the batch size is small
> > > > enough that should work?
> > >
> > > Even with a small batch size, for small writeback requests triggered by
> > > user-space (e.g., 2MB, which is batch size * N), it might still repeatedly
> > > write back from only the first N child memcgs.
> >
> > Yes, I understand, I am asking if this is a problem in practice. For
> > this to be a problem we'd need to trigger small writeback requests and
> > have many memcgs.
> >
> > > This could cause the user-space agent to prematurely give up on zswap
> > > writeback.
> >
> > Why? The kernel should not return before trying to writeback from all
> > memcgs. If we scan the first N child memcgs and did not writeback
> > enough, we should keep going, right?
> >
>
> Yes, this issue is not caused by the kernel, but rather by our user-space
> agent itself.
>
> For instance, suppose a parent memcg has two children, memcg1 and memcg2,
> each with 200MB of zswap (100MB inactive). Triggering proactive writeback on
> the parent memcg will exhaust memcg1's inactive zswap pages. After that,
> even though memcg2 still has plenty of inactive zswap pages, it will
> continue to write back memcg1's active zswap pages. Writing back active
> zswap pages causes the user-space agent to prematurely abort the writeback
> because it detects that certain memcg metrics have exceeded predefined
> thresholds.

This will only happen if the reclaim size is smaller than the batch
size, right? Otherwise the kernel should reclaim more or less equally
from both memcgs?

> Of course, real-world scenarios are much more complex, and this kind of case
> is extremely rare in our environment.
>
> That being said, your suggestion of using the global lock for the per-memcg
> cursors makes the writeback fairer and would resolve these corner cases.

Right, but I'd rather not do per-memcg cursors at all if we can avoid
it. Will using batches help make reclaim fair over all memcgs without a
cursor?

We can always add the cursor later if needed.