Re: [PATCH 2/3] mm/zswap: Implement proactive writeback

From: Hao Jia

Date: Wed May 13 2026 - 04:05:13 EST

On 2026/5/12 23:47, Nhat Pham wrote:

On Tue, May 12, 2026 at 2:32 AM Hao Jia <jiahao.kernel@xxxxxxxxx> wrote:

On 2026/5/12 03:57, Yosry Ahmed wrote:

On Mon, May 11, 2026 at 12:49 PM Nhat Pham <nphamcs@xxxxxxxxx> wrote:

On Mon, May 11, 2026 at 3:52 AM Hao Jia <jiahao.kernel@xxxxxxxxx> wrote:

From: Hao Jia <jiahao1@xxxxxxxxxxx>

Zswap currently writes back pages to backing swap devices reactively,
triggered either by memory pressure via the shrinker or by the pool
reaching its size limit. This reactive approach offers no precise
control over when writeback happens, which can disturb latency-sensitive
workloads, and it cannot direct writeback at a specific memory cgroup.
However, there are scenarios where users might want to proactively
write back cold pages from zswap to the backing swap device, for
example, to free up memory for other applications or to prepare for
upcoming memory-intensive workloads.

Therefore, implement a proactive writeback mechanism for zswap by
adding a new cgroup interface file memory.zswap.proactive_writeback
within the memory controller.

Thanks Nhat, Yosry — let me address both comments together.

We already have memory.reclaim, no? Would that not work to create
headroom generally for your use case? Is there a reason why we are
treating zswap memory as special here?

Apologies for the lack of detailed explanation in the patch description,
which led to the confusion.

While we are already utilizing memory.reclaim, it does not fully address
our requirements.

Our deployment runs a userspace proactive reclaimer that drives
memory.reclaim based on the system's runtime state (memory/CPU/IO
pressure, refault rate, ...) and workload-specific
policy. That first stage compresses cold anon pages into zswap. Entries
that then remain in zswap past a policy-defined age threshold are
considered "twice cold", and the reclaimer wants
to write them back to the backing swap device at a moment of its own
choosing, to further reclaim the DRAM still held by the compressed data.

This is the "second-level offloading" pattern described in Meta's TMO
paper [1]. zswap proactive writeback is what this series introduces to
address that second-level offloading stage.

[1] https://www.pdl.cmu.edu/ftp/NVM/tmo_asplos22.pdf

Yeah that's what we've been trying to work on as well :) We are
working on a couple of improvements to the mechanism side of this path
(cc Alex) - hopefully it will help your use case too!

Anyway, back to my original inquiry: I understand your use case. It's
pretty similar to our goal. What I'm not getting is why is
memory.reclaim (which you already use) not sufficient for zswap ->
disk swap offloading too?

Zswap objects are organized into LRU and exposed to the shrinker
interface. Echo-ing to memory.reclaim should also offload some zswap
entries, correct? Are there still cold zswap entries that escape this,
somehow?

Yes, the memory.reclaim path does drive some zswap writeback, but
it is not enough for our case.

1. For a memcg that has reached steady state (a common case being
when memory.current is below the policy target), the userspace
reclaimer may not invoke memory.reclaim on it for a long time,
and so no second-level offloading happens through
memory.reclaim. In this state we want
memory.zswap.proactive_writeback to write back entries that
have sat in zswap past an age threshold, to further reclaim
the DRAM still held by the compressed data.

2. Even when memory.reclaim is running, the fraction of zswap
residency that ends up reaching the backing swap device is
still very small for many of our workloads, and the userspace
reclaimer has no way to participate in or control the
granularity of zswap writeback. So in our deployment we prefer
to leave the zswap shrinker disabled, decouple LRU -> zswap
from zswap -> swap, and use a dedicated proactive-writeback
interface that lifts the writeback policy into userspace where
it can evolve independently of the kernel.

Thanks,
Hao

Furthermore, we already have a way to detect the "twice cold" entries
you mentioned: the referenced bit. This is analogous to the way we
treat uncompressed pages.

+1, why do we need to specifically proactively reclaim the compressed memory?

Also, if we do need to minimize the compressed memory and force higher
writeback rates, we can do so with memory.zswap.max, right?

Here are a few reasons why memory.zswap.max is not enough:

1. Writing memory.zswap.max itself does not trigger any writeback
immediately. For a memcg that has reached steady state (on which the
userspace reclaimer is no longer invoking
memory.reclaim), after enough time has passed, the reclaimer has no good
way to trigger proactive writeback for second-level offloading by
lowering memory.zswap.max, because in steady
state nothing drives the zswap_store() -> shrink_memcg() path. The
userspace reclaimer still has no control over when proactive writeback
happens.

2. memory.zswap.max currently triggers zswap writeback via zswap_store()
-> shrink_memcg(), and each over-limit event can write back at most
NR_NODES entries. If zswap residency is far
above memory.zswap.max, converging to the target size requires at least
O(over-limit pages / NR_NODES) zswap_store() events, with no batching —
proactive writeback therefore has
significant latency.

3. memory.zswap.max is a stateful interface. If the userspace reclaimer
crashes for any reason mid-operation, it may leave memory.zswap.max at
some set value, putting the application in a
persistently throttled bad state.

4. Once the userspace reclaimer has lowered memory.zswap.max, if the
workload is rapidly expanding and triggers memory reclaim via
memory.high / kswapd / etc., the actual amount written
back can exceed what was intended.

One more reason: IIRC, when you set memory.zswap.max to a value other
than 0 max, every zswap store incurs a pretty expensive check
(obj_cgroup_may_zswap), which does a force flush
(__mem_cgroup_flush_stats). That was pretty expensive last time some
of our internal services played with it. So yeah, it's not ideal...

(if you're using this, might wanna profile this as well).

Thanks,
Hao