Re: [PATCH v5 4/6] mm/zswap: Implement proactive writeback

From: Hao Jia

Date: Mon Jun 29 2026 - 21:49:32 EST

On 2026/6/30 08:15, Yosry Ahmed wrote:

On Mon, Jun 29, 2026 at 07:20:30PM +0800, Hao Jia wrote:

From: Hao Jia <jiahao1@xxxxxxxxxxx>

Zswap currently writes back pages to backing swap reactively, triggered
either by the shrinker or when the pool reaches its size limit. There is
no mechanism to control the amount of writeback for a specific memory
cgroup. However, users may want to proactively write back zswap pages,
e.g., to free up memory for other applications or to prepare for
memory-intensive workloads.

Introduce a "source=" key to the memory.reclaim cgroup interface,
currently accepting the single value "zswap". When set to "zswap", it
bypasses standard memory reclaim and exclusively performs proactive
zswap writeback up to the requested budget. If omitted, the default
reclaim behavior remains unchanged.

Example usage:
# Write back 10MB of compressed data from zswap to the backing swap
echo "10M source=zswap" > memory.reclaim

Note that the actual amount of compressed data written back may be less
than requested due to the zswap second-chance algorithm: referenced
entries are rotated on the LRU on the first encounter and only written
back on a second pass. If fewer bytes are written back than requested,
-EAGAIN is returned, matching the existing memory.reclaim semantics.

Internally, extend user_proactive_reclaim() to parse the new "source="
key and invoke the dedicated handler zswap_proactive_writeback() when it
is set to "zswap". This handler walks the target memcg subtree in a
round-robin fashion and drains each memcg's per-node zswap LRUs through
shrink_memcg(), accumulating the compressed bytes written back until the
requested budget is met.

Suggested-by: Yosry Ahmed <yosry@xxxxxxxxxx>
Suggested-by: Nhat Pham <nphamcs@xxxxxxxxx>
Signed-off-by: Hao Jia <jiahao1@xxxxxxxxxxx>
---

Before going through more versions we need to figure out if this will
pivot to be a proactive demotion interfcae for swap tiering.

Yes. Should I drop patches 4-6 in the next version and wait for swap tiering to be finalized?
We can try to get the non-memcg parts (patches 1-3) merged upstream first. This would also give them plenty of time to bake and catch any potential regressions. Thoughts?

@@ -7869,9 +7872,12 @@ int user_proactive_reclaim(char *buf,
unsigned int nr_retries = MAX_RECLAIM_RETRIES;
unsigned long nr_to_reclaim, nr_reclaimed = 0;
int swappiness = -1;
+ bool zswap_writeback_only = false;
char *old_buf, *start;
+ char source[16];
substring_t args[MAX_OPT_ARGS];
gfp_t gfp_mask = GFP_KERNEL;
+ u64 nr_bytes;
if (!buf || (!memcg && !pgdat) || (memcg && pgdat))
return -EINVAL;
@@ -7879,7 +7885,8 @@ int user_proactive_reclaim(char *buf,
buf = strstrip(buf);
old_buf = buf;
- nr_to_reclaim = memparse(buf, &buf) / PAGE_SIZE;
+ nr_bytes = memparse(buf, &buf);
+ nr_to_reclaim = nr_bytes / PAGE_SIZE;

Nit: if we keep this as part of memory.reclaim, we probably want to
choose clearer names (e.g. pages_to_reclaim and bytes_to_reclaim).

Will do.

if (buf == old_buf)
return -EINVAL;
@@ -7899,11 +7906,26 @@ int user_proactive_reclaim(char *buf,
case MEMORY_RECLAIM_SWAPPINESS_MAX:
swappiness = SWAPPINESS_ANON_ONLY;
break;
+ case MEMORY_RECLAIM_SOURCE:
+ if (match_strlcpy(source, &args[0], sizeof(source)) >= sizeof(source))
+ return -EINVAL;
+ /* Only zswap is supported as a reclaim source for now. */
+ if (strcmp(source, "zswap"))
+ return -EINVAL;
+ zswap_writeback_only = true;
+ break;
default:
return -EINVAL;
}
}
+ if (zswap_writeback_only) {
+ /* source=zswap and swappiness are mutually exclusive. */
+ if (swappiness != -1)
+ return -EINVAL;
+ return zswap_proactive_writeback(memcg, nr_bytes);
+ }
+
while (nr_reclaimed < nr_to_reclaim) {
/* Will converge on zero, but reclaim enforces a minimum */
unsigned long batch_size = (nr_to_reclaim - nr_reclaimed) / 4;
diff --git a/mm/zswap.c b/mm/zswap.c
index ba01bf0e44e9..9cda96f05508 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -1713,6 +1713,56 @@ int zswap_load(struct folio *folio)
return 0;
}
+int zswap_proactive_writeback(struct mem_cgroup *memcg, u64 bytes_to_writeback)
+{
+ struct zswap_shrink_state s = {};
+ struct mem_cgroup *iter = NULL;
+ u64 bytes_written = 0;
+ int ret = 0;
+
+ if (!memcg)
+ return -EINVAL;

Can this ever happen? It would be a bug in the caller.

IIRC，Writing the following to the NUMA node sysfs entry triggers this check:
echo "10M source=zswap" > /sys/devices/system/node/nodeN/reclaim

+ if (!mem_cgroup_zswap_writeback_enabled(memcg))
+ return -EINVAL;
+ if (!bytes_to_writeback)
+ return 0;

Do we need this? I think the loop will just never enter and
mem_cgroup_iter_break() will do nothing.

Will do.

+
+ while (bytes_written < bytes_to_writeback) {
+ long shrunk;
+
+ cond_resched();
+
+ if (signal_pending(current)) {
+ ret = -EINTR;
+ break;
+ }
+
+ /*
+ * Use a local iterator to walk the memcg and its online descendants
+ * in a round-robin manner. Upon exiting the loop, mem_cgroup_iter_break()
+ * must be called to drop the iterator reference.
+ */
+ do {
+ iter = mem_cgroup_iter(memcg, iter, NULL);
+ } while (iter && !mem_cgroup_tryget_online(iter));
+
+ shrunk = zswap_shrink_one_memcg(iter, &s);
+ if (shrunk > 0)
+ bytes_written += shrunk;
+
+ /* drop the extra reference taken by mem_cgroup_tryget_online() */
+ mem_cgroup_put(iter);

Can we just use mem_cgroup_online() instead since mem_cgroup_iter()
already graps a ref?

Will do.

Thanks,
Hao