Re: [PATCH 2/3] mm/zswap: Implement proactive writeback

From: Nhat Pham

Date: Tue May 12 2026 - 11:53:26 EST

On Tue, May 12, 2026 at 2:32 AM Hao Jia <jiahao.kernel@xxxxxxxxx> wrote:
>
>
>
> On 2026/5/12 03:57, Yosry Ahmed wrote:
> > On Mon, May 11, 2026 at 12:49 PM Nhat Pham <nphamcs@xxxxxxxxx> wrote:
> >>
> >> On Mon, May 11, 2026 at 3:52 AM Hao Jia <jiahao.kernel@xxxxxxxxx> wrote:
> >>>
> >>> From: Hao Jia <jiahao1@xxxxxxxxxxx>
> >>>
> >>> Zswap currently writes back pages to backing swap devices reactively,
> >>> triggered either by memory pressure via the shrinker or by the pool
> >>> reaching its size limit. This reactive approach offers no precise
> >>> control over when writeback happens, which can disturb latency-sensitive
> >>> workloads, and it cannot direct writeback at a specific memory cgroup.
> >>> However, there are scenarios where users might want to proactively
> >>> write back cold pages from zswap to the backing swap device, for
> >>> example, to free up memory for other applications or to prepare for
> >>> upcoming memory-intensive workloads.
> >>>
> >>> Therefore, implement a proactive writeback mechanism for zswap by
> >>> adding a new cgroup interface file memory.zswap.proactive_writeback
> >>> within the memory controller.
> >>
>
> Thanks Nhat, Yosry — let me address both comments together.
>
> >>
> >> We already have memory.reclaim, no? Would that not work to create
> >> headroom generally for your use case? Is there a reason why we are
> >> treating zswap memory as special here?
> >
>
> Apologies for the lack of detailed explanation in the patch description,
> which led to the confusion.
>
> While we are already utilizing memory.reclaim, it does not fully address
> our requirements.
>
> Our deployment runs a userspace proactive reclaimer that drives
> memory.reclaim based on the system's runtime state (memory/CPU/IO
> pressure, refault rate, ...) and workload-specific
> policy. That first stage compresses cold anon pages into zswap. Entries
> that then remain in zswap past a policy-defined age threshold are
> considered "twice cold", and the reclaimer wants
> to write them back to the backing swap device at a moment of its own
> choosing, to further reclaim the DRAM still held by the compressed data.
>
> This is the "second-level offloading" pattern described in Meta's TMO
> paper [1]. zswap proactive writeback is what this series introduces to
> address that second-level offloading stage.
>
> [1] https://www.pdl.cmu.edu/ftp/NVM/tmo_asplos22.pdf

Yeah that's what we've been trying to work on as well :) We are
working on a couple of improvements to the mechanism side of this path
(cc Alex) - hopefully it will help your use case too!

Anyway, back to my original inquiry: I understand your use case. It's
pretty similar to our goal. What I'm not getting is why is
memory.reclaim (which you already use) not sufficient for zswap ->
disk swap offloading too?

Zswap objects are organized into LRU and exposed to the shrinker
interface. Echo-ing to memory.reclaim should also offload some zswap
entries, correct? Are there still cold zswap entries that escape this,
somehow?

Furthermore, we already have a way to detect the "twice cold" entries
you mentioned: the referenced bit. This is analogous to the way we
treat uncompressed pages.

>
>
> > +1, why do we need to specifically proactively reclaim the compressed memory?
> >
> > Also, if we do need to minimize the compressed memory and force higher
> > writeback rates, we can do so with memory.zswap.max, right?
>
> Here are a few reasons why memory.zswap.max is not enough:
>
> 1. Writing memory.zswap.max itself does not trigger any writeback
> immediately. For a memcg that has reached steady state (on which the
> userspace reclaimer is no longer invoking
> memory.reclaim), after enough time has passed, the reclaimer has no good
> way to trigger proactive writeback for second-level offloading by
> lowering memory.zswap.max, because in steady
> state nothing drives the zswap_store() -> shrink_memcg() path. The
> userspace reclaimer still has no control over when proactive writeback
> happens.
>
> 2. memory.zswap.max currently triggers zswap writeback via zswap_store()
> -> shrink_memcg(), and each over-limit event can write back at most
> NR_NODES entries. If zswap residency is far
> above memory.zswap.max, converging to the target size requires at least
> O(over-limit pages / NR_NODES) zswap_store() events, with no batching —
> proactive writeback therefore has
> significant latency.
>
> 3. memory.zswap.max is a stateful interface. If the userspace reclaimer
> crashes for any reason mid-operation, it may leave memory.zswap.max at
> some set value, putting the application in a
> persistently throttled bad state.
>
> 4. Once the userspace reclaimer has lowered memory.zswap.max, if the
> workload is rapidly expanding and triggers memory reclaim via
> memory.high / kswapd / etc., the actual amount written
> back can exceed what was intended.

One more reason: IIRC, when you set memory.zswap.max to a value other
than 0 max, every zswap store incurs a pretty expensive check
(obj_cgroup_may_zswap), which does a force flush
(__mem_cgroup_flush_stats). That was pretty expensive last time some
of our internal services played with it. So yeah, it's not ideal...

(if you're using this, might wanna profile this as well).

>
> Thanks,
> Hao