Re: [PATCH v4 2/5] mm/zswap: Factor writeback loop out of shrink_worker()

From: Yosry Ahmed

Date: Fri Jun 26 2026 - 13:10:52 EST

> >> /*
> >> * Take one step of a memcg-tree writeback walk driven by the caller's
> >> * iterator, and fold the result into @s, the retry bookkeeping shared
> >> * across steps. @memcg is the iterator's current memcg, or NULL once
> >> * it has wrapped around after a full pass over the tree.
> >> *
> >> * The function returns -EAGAIN to signal the caller to abort the walk
> >> * after encountering the following conditions MAX_RECLAIM_RETRIES times:
> >> * - No writeback-candidate memcgs were found in a memcg tree walk.
> >> * - Shrinking a writeback-candidate memcg failed.
> >
> > Orthogonal to this patch, but I wonder if this can be simplified. I
> > wonder if these two conditions can be replaced with "shrinking a memcg
> > that has zswap entries failed". The "no writeback-candidate memcgs in
> > the tree" case seems like we should abort right away instead of
> > retrying?
> >
> > Nhat, WDYT?
> >
>
> Perhaps something like the following is what you had in mind? I've
> drafted the implementation below to make it easier for Nhat to compare
> with the previous behavior.

Hmm I think if we pursue this it should be in a separate patch or even
outside of this series, ideally with numbers/proof that it's not
introducing regressions to the scenario that lead to its introduction.

>
>
> >> *
> >> * Return: The number of compressed bytes written back (>= 0), or -EAGAIN
> >> * once the retry budget is exhausted and the caller should abort the walk.
> >> */
> >> static long zswap_shrink_one(struct mem_cgroup *memcg,
> >
> > Nit: zswap_shrink_one_memcg()
> >
> > BTW, the existing writeback logic has been broken for a while now when
> > memcg is disabled. I think we constantly hit the !memcg case and run
> > out of retries. Not sure if your patch changes this in any way, or if
> > you want to fix that while you're at it :)
>
> Yes, I'd be happy to do that. However, would it be better to submit a
> separate fix patch or combine it with this one?

A separate patch. Feel free to send it with this series to avoid
conflicts, but probably as patch 1 as we'll want to CC stable on it.

[..]

> /* Track progress of a memcg-tree writeback walk. */
> struct zswap_shrink_state {
> int scans;
> int failures;
> };
>
> /*
> * Take one step of a memcg-tree writeback walk driven by the caller's
> * iterator, and fold the result into @s, the retry bookkeeping shared
> * across steps. @memcg is the iterator's current memcg, or NULL once
> * it has wrapped around after a full pass over the tree.
> *
> * The function returns -EBUSY to signal the caller to abort the walk when
> * either of the following occurs:
> * - A full pass over the tree found no writeback-candidate memcg.
> * - Shrinking a writeback-candidate memcg failed MAX_RECLAIM_RETRIES
> times.
> *
> * When memory cgroup is disabled, the iterator always yields NULL. All
> * zswap entries then live on the root list_lru, so NULL is treated as the
> * root memcg and shrunk directly rather than as a completed tree pass.

I think this chunk should be moved above the code returning -EBUSY
when mem_cgroup_disabled() is true, and probably made more succinct as
it should be obvious.

> *
> * Return: The number of compressed bytes written back (>= 0), or -EBUSY
> * when the caller should abort the walk.
> */
> static long zswap_shrink_one_memcg(struct mem_cgroup *memcg,
> struct zswap_shrink_state *s)
> {
> bool disabled = mem_cgroup_disabled();

No need to store this in a variable AFAICT, it's a static branch and
it's clearer to just call it directly in both call sites imo.

> long shrunk;
>
> /*
> * If the iterator has completed a full pass, update the shrink state
> * and check whether we should keep going.
> * With memcg disabled the iterator always yields NULL, so fall through
> * and shrink the root memcg directly instead.
> */
> if (!memcg && !disabled) {
> /*
> * Abort if no writeback-candidate memcgs in the last tree walk.
> * Otherwise reset the scans count and continue.
> */
> if (!s->scans)
> return -EBUSY;
> s->scans = 0;
> return 0;
> }
>
> shrunk = shrink_memcg(memcg, NR_ZSWAP_WB_BATCH);
>
> /*
> * There are no writeback-candidate pages in the memcg. With memcg
> * enabled this is not an issue as long as we can find another memcg
> * with pages in zswap, so skip without counting it as a candidate.
> * With memcg disabled the root LRU is the only target, so we should
> * abort if it has no writeback-candidate pages.
> */
> if (shrunk == -ENOENT)
> return disabled ? -EBUSY : 0;
> s->scans++;
>
> if (shrunk <= 0 && ++s->failures == MAX_RECLAIM_RETRIES)
> return -EBUSY;
>
> return shrunk;
> }
>
> static void shrink_worker(struct work_struct *w)
> {
> struct zswap_shrink_state s = {};
> unsigned long thr;
>
> /* Reclaim down to the accept threshold */
> thr = zswap_accept_thr_pages();
>
> /*
> * Global reclaim will select cgroup in a round-robin fashion from all
> * online memcgs, but memcgs that have no pages in zswap and
> * writeback-disabled memcgs (memory.zswap.writeback=0) are not
> * candidates for shrinking.
> *
> * We save iteration cursor memcg into zswap_next_shrink,
> * which can be modified by the offline memcg cleaner
> * zswap_memcg_offline_cleanup().
> *
> * Since the offline cleaner is called only once, we cannot leave an
> * offline memcg reference in zswap_next_shrink.
> * We can rely on the cleaner only if we get online memcg under lock.
> *
> * If we get an offline memcg, we cannot determine if the cleaner has
> * already been called or will be called later. We must put back the
> * reference before returning from this function. Otherwise, the
> * offline memcg left in zswap_next_shrink will hold the reference
> * until the next run of shrink_worker().
> */
> while (zswap_total_pages() > thr) {
> struct mem_cgroup *memcg;
> long ret;
>
> cond_resched();
> /*
> * Start shrinking from the next memcg after zswap_next_shrink.
> * When the offline cleaner has already advanced the cursor,
> * advancing the cursor here overlooks one memcg, but this
> * should be negligibly rare.
> *
> * If we get an online memcg, keep the extra reference in case
> * the original one obtained by mem_cgroup_iter() is dropped by
> * zswap_memcg_offline_cleanup() while we are shrinking the
> * memcg.
> */
> spin_lock(&zswap_shrink_lock);
> do {
> memcg = mem_cgroup_iter(NULL, zswap_next_shrink, NULL);
> zswap_next_shrink = memcg;
> } while (memcg && !mem_cgroup_tryget_online(memcg));
> spin_unlock(&zswap_shrink_lock);
>
> ret = zswap_shrink_one_memcg(memcg, &s);
> /* drop the extra reference taken above */
> mem_cgroup_put(memcg);
> if (ret == -EBUSY)
> break;
> }
> }
>
>
> Thanks,
> Hao