Re: [PATCH 0/3] mm/zswap: Implement per-cgroup proactive writeback

From: Hao Jia

Date: Tue May 12 2026 - 07:27:32 EST




On 2026/5/11 19:39, Michal Koutný wrote:
On Mon, May 11, 2026 at 06:51:46PM +0800, Hao Jia <jiahao.kernel@xxxxxxxxx> wrote:
From: Hao Jia <jiahao1@xxxxxxxxxxx>

Zswap currently writes back pages to backing swap devices reactively,
triggered either by memory pressure via the shrinker or by the pool
reaching its size limit. However, this reactive approach makes writeback
timing indeterminate and can disrupt latency-sensitive workloads when
eviction happens to coincide with a critical execution window.

Furthermore, in certain scenarios, it is desirable to trigger writeback
in advance to free up memory. For example, users may want to prepare for
an upcoming memory-intensive workload by flushing cold memory to the
backing storage when the system is relatively idle.

I can imagine the zswap writeout can come at the least possible
moment...

To address these issues, this patch series introduces a per-cgroup
interface that allows users to proactively write back cold compressed
pages from zswap to the backing swap device.

...but I see this series is not only per-cgroup proactive reclaim but
it's also age-based reclaim.

The per-cg consumption and limits (and regular memory reclaim) are all
measured in sizes. This age-based invocations don't seem commensurable
(e.g. how would users in practice determine what is the desired input to
here).


Thanks Michal — you are right. The series is both per-memcg *and*
age-based.

The interface carries a size budget, like memory.reclaim. The two
parameters play different roles:

"write back up to <max> bytes, chosen from entries whose residency
in zswap is at least <age>"

Size stays the unit of *amount*; age is just how we describe *which*
entries are eligible.


Could you explain more reasoning behind this design?


Context on the use case:

Our deployment runs a userspace proactive reclaimer driven by the
system's runtime state (memory/CPU/IO pressure, refault rate, ...)
and workload-specific policy. It uses memory.reclaim to drive
reclaim, which compresses cold anon pages into zswap as the first
stage. For entries that then remain in zswap past a policy-defined
age threshold, the reclaimer wants to write them back to the backing
swap device at a moment of its own choosing, to further reclaim the
DRAM still held by the compressed data.

Why age is a reasonable selector at this stage:

Pages in zswap have already passed a first-stage coldness judgement
(otherwise they would not have been compressed). For second-level
offloading, the question is which of them are cold *enough*.
Time-in-zswap is a natural proxy for that. A swap-in invalidates the
corresponding zswap entry and resets the clock, so by construction
an entry that has sat in zswap for N seconds has not been faulted in
for at least N seconds. Residency in zswap is therefore a strong
signal that the entry is not about to refault.

In our deployment the userspace reclaimer starts from a conservative threshold (the starting value depends on the workload) and adjusts it through closed-loop feedback:

- on one side, the age distribution of zswap entries, to see
whether there is a meaningful population past the threshold;
- on the other side, the post-writeback refault rate and related
signals, to confirm that entries written back were in fact cold
enough.

Both <age> and max=<bytes> are tuned against these signals until the
realized writeback volume matches target. This is the same
control-loop style already used to drive the first-stage
memory.reclaim budget.

Thanks,
Hao