Re: [PATCH 2/3] mm/zswap: Implement proactive writeback

From: Nhat Pham

Date: Wed May 13 2026 - 17:14:52 EST

On Wed, May 13, 2026 at 1:04 AM Hao Jia <jiahao.kernel@xxxxxxxxx> wrote:
>
>
>
> On 2026/5/12 23:47, Nhat Pham wrote:
> > On Tue, May 12, 2026 at 2:32 AM Hao Jia <jiahao.kernel@xxxxxxxxx> wrote:
> >>
> >>
> >>
> >> On 2026/5/12 03:57, Yosry Ahmed wrote:
> >>> On Mon, May 11, 2026 at 12:49 PM Nhat Pham <nphamcs@xxxxxxxxx> wrote:
> >>>>
> >>>> On Mon, May 11, 2026 at 3:52 AM Hao Jia <jiahao.kernel@xxxxxxxxx> wrote:
> >>>>>
> >>>>> From: Hao Jia <jiahao1@xxxxxxxxxxx>
> >>>>>
> >>>>> Zswap currently writes back pages to backing swap devices reactively,
> >>>>> triggered either by memory pressure via the shrinker or by the pool
> >>>>> reaching its size limit. This reactive approach offers no precise
> >>>>> control over when writeback happens, which can disturb latency-sensitive
> >>>>> workloads, and it cannot direct writeback at a specific memory cgroup.
> >>>>> However, there are scenarios where users might want to proactively
> >>>>> write back cold pages from zswap to the backing swap device, for
> >>>>> example, to free up memory for other applications or to prepare for
> >>>>> upcoming memory-intensive workloads.
> >>>>>
> >>>>> Therefore, implement a proactive writeback mechanism for zswap by
> >>>>> adding a new cgroup interface file memory.zswap.proactive_writeback
> >>>>> within the memory controller.
> >>>>
> >>
> >> Thanks Nhat, Yosry — let me address both comments together.
> >>
> >>>>
> >>>> We already have memory.reclaim, no? Would that not work to create
> >>>> headroom generally for your use case? Is there a reason why we are
> >>>> treating zswap memory as special here?
> >>>
> >>
> >> Apologies for the lack of detailed explanation in the patch description,
> >> which led to the confusion.
> >>
> >> While we are already utilizing memory.reclaim, it does not fully address
> >> our requirements.
> >>
> >> Our deployment runs a userspace proactive reclaimer that drives
> >> memory.reclaim based on the system's runtime state (memory/CPU/IO
> >> pressure, refault rate, ...) and workload-specific
> >> policy. That first stage compresses cold anon pages into zswap. Entries
> >> that then remain in zswap past a policy-defined age threshold are
> >> considered "twice cold", and the reclaimer wants
> >> to write them back to the backing swap device at a moment of its own
> >> choosing, to further reclaim the DRAM still held by the compressed data.
> >>
> >> This is the "second-level offloading" pattern described in Meta's TMO
> >> paper [1]. zswap proactive writeback is what this series introduces to
> >> address that second-level offloading stage.
> >>
> >> [1] https://www.pdl.cmu.edu/ftp/NVM/tmo_asplos22.pdf
> >
> > Yeah that's what we've been trying to work on as well :) We are
> > working on a couple of improvements to the mechanism side of this path
> > (cc Alex) - hopefully it will help your use case too!
> >
> > Anyway, back to my original inquiry: I understand your use case. It's
> > pretty similar to our goal. What I'm not getting is why is
> > memory.reclaim (which you already use) not sufficient for zswap ->
> > disk swap offloading too?
> >
> > Zswap objects are organized into LRU and exposed to the shrinker
> > interface. Echo-ing to memory.reclaim should also offload some zswap
> > entries, correct? Are there still cold zswap entries that escape this,
> > somehow?
> >
>
> Yes, the memory.reclaim path does drive some zswap writeback, but
> it is not enough for our case.
>
> 1. For a memcg that has reached steady state (a common case being
> when memory.current is below the policy target), the userspace
> reclaimer may not invoke memory.reclaim on it for a long time,
> and so no second-level offloading happens through
> memory.reclaim. In this state we want
> memory.zswap.proactive_writeback to write back entries that
> have sat in zswap past an age threshold, to further reclaim
> the DRAM still held by the compressed data.
>
> 2. Even when memory.reclaim is running, the fraction of zswap
> residency that ends up reaching the backing swap device is
> still very small for many of our workloads, and the userspace
> reclaimer has no way to participate in or control the
> granularity of zswap writeback. So in our deployment we prefer
> to leave the zswap shrinker disabled, decouple LRU -> zswap
> from zswap -> swap, and use a dedicated proactive-writeback
> interface that lifts the writeback policy into userspace where
> it can evolve independently of the kernel.

I see. It's interesting - we've been dealing with the opposite
problems (reclaiming too much from zswap) that it's refreshing to see
the other end of the spectrum :) We should invest more into this to
see why we are not reclaiming enough, but I see the value of adding a
knob to hit zswap exclusively.

Regarding age-based reclaim, I agree with Yosry here. Let us try to
land an interface to do targeted reclaim on compressed memory first. I
do see the value of age information: with it, you can track zswap
entries ages and the distribution of refault ages, and only reclaim
the tail. However, I wonder if you can just build a system that adapt
the reclaim request size based on PSI, refault rate etc. similar to
how you're adjusting memory.reclaim on uncompressed memories with a
senpai-like system. Something along the line of - if we are swapping
in too much from disk (or if IO pressure is high), back off, and if
not, stealing a bit more from zswap pool (perhaps with a bigger step
size), etc. Is there a reason why zswap cannot adopt a similar
strategy?