Re: [PATCH v3 2/4] mm/zswap: Implement proactive writeback

From: Yosry Ahmed

Date: Fri May 29 2026 - 21:37:54 EST

On Tue, May 26, 2026 at 07:45:59PM +0800, Hao Jia wrote:
> From: Hao Jia <jiahao1@xxxxxxxxxxx>
>
> Zswap currently writes back pages to backing swap reactively, triggered
> either by the shrinker or when the pool reaches its size limit. There is
> no mechanism to control the amount of writeback for a specific memory
> cgroup. However, users may want to proactively write back zswap pages,
> e.g., to free up memory for other applications or to prepare for
> memory-intensive workloads.
>
> Introduce a "zswap_writeback_only" key to the memory.reclaim cgroup
> interface. When specified, this key bypasses standard memory reclaim
> and exclusively performs proactive zswap writeback up to the requested
> budget. If omitted, the default reclaim behavior remains unchanged.
>
> Example usage:
> # Write back 100MB of pages from zswap to the backing swap
> echo "100M zswap_writeback_only" > memory.reclaim
>
> Note that the actual amount written back may be less than requested due
> to the zswap second-chance algorithm: referenced entries are rotated on
> the LRU on the first encounter and only written back on a second pass.
> If fewer bytes are written back than requested, -EAGAIN is returned,
> matching the existing memory.reclaim semantics.
>
> Internally, extend user_proactive_reclaim() to parse the new
> "zswap_writeback_only" token and invoke the dedicated handler. Add
> zswap_proactive_writeback() to walk the target memcg subtree via the
> per-memcg writeback cursor, draining per-node zswap LRUs through
> list_lru_walk_one() with the shrink_memcg_cb() callback.
>
> Suggested-by: Yosry Ahmed <yosry@xxxxxxxxxx>
> Suggested-by: Nhat Pham <nphamcs@xxxxxxxxx>
> Signed-off-by: Hao Jia <jiahao1@xxxxxxxxxxx>
> ---
> Documentation/admin-guide/cgroup-v2.rst | 18 +++-
> Documentation/admin-guide/mm/zswap.rst | 11 +-
> include/linux/zswap.h | 7 ++
> mm/vmscan.c | 14 +++
> mm/zswap.c | 138 ++++++++++++++++++++++++
> 5 files changed, 185 insertions(+), 3 deletions(-)
>
> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> index 6efd0095ed99..6564abf0dec5 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -1425,9 +1425,10 @@ PAGE_SIZE multiple when read back.
>
> The following nested keys are defined.
>
> - ========== ================================
> + ==================== ==================================================
> swappiness Swappiness value to reclaim with
> - ========== ================================
> + zswap_writeback_only Only perform proactive zswap writeback
> + ==================== ==================================================
>
> Specifying a swappiness value instructs the kernel to perform
> the reclaim with that swappiness value. Note that this has the
> @@ -1437,6 +1438,19 @@ The following nested keys are defined.
> The valid range for swappiness is [0-200, max], setting
> swappiness=max exclusively reclaims anonymous memory.
>
> + The zswap_writeback_only key skips ordinary memory reclaim and
> + writes back pages from zswap to the backing swap device until
> + the requested amount has been written or no further candidates
> + are found. This is useful to proactively offload cold pages from
> + the zswap pool to the swap device. It is only available if
> + zswap writeback is enabled. zswap_writeback_only cannot be combined
> + with swappiness; specifying both returns -EINVAL.
> +
> + Example::
> +
> + # Write back up to 100MB of pages from zswap to the backing swap
> + echo "100M zswap_writeback_only" > memory.reclaim

memcg folks need to chime in about the interface here. An alternative
would be a separate interface (e.g. memory.zswap.do_writeback or
memory.zswap.writeback.reclaim or sth).

> diff --git a/mm/zswap.c b/mm/zswap.c
> index 73e64a635690..7bcbf788f634 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -1679,6 +1679,144 @@ int zswap_load(struct folio *folio)
> return 0;
> }
>
> +/*
> + * Maximum LRU scan limit:
> + * number of entries to scan per page of remaining budget.
> + */
> +#define ZSWAP_PROACTIVE_WB_SCAN_RATIO 16UL
> +/*
> + * Batch size for proactive writeback:
> + * - As the per-memcg writeback target in the outer memcg loop.
> + * - As the per-walk budget passed to list_lru_walk_one().
> + */
> +#define ZSWAP_PROACTIVE_WB_BATCH 128UL
> +
> +/*
> + * Walk the per-node LRUs of @memcg to write back up to @nr_to_write pages.
> + * Returns the number of pages written back, or -ENOENT if @memcg is a
> + * zombie or has writeback disabled.
> + */
> +static long zswap_proactive_shrink_memcg(struct mem_cgroup *memcg,
> + unsigned long nr_to_write)
> +{
> + unsigned long nr_written = 0;
> + int nid;
> +
> + if (!mem_cgroup_zswap_writeback_enabled(memcg))
> + return -ENOENT;
> +
> + if (!mem_cgroup_online(memcg))
> + return -ENOENT;
> +
> + for_each_node_state(nid, N_NORMAL_MEMORY) {
> + bool encountered_page_in_swapcache = false;
> + unsigned long nr_to_scan, nr_scanned = 0;
> +
> + /*
> + * Cap by LRU length: bounds rewalks when referenced
> + * entries keep rotating to the tail.
> + */
> + nr_to_scan = list_lru_count_one(&zswap_list_lru, nid, memcg);
> + if (!nr_to_scan)
> + continue;
> +
> + /*
> + * Cap by SCAN_RATIO * remaining budget: bounds scan cost
> + * to the remaining writeback budget.
> + */
> + nr_to_scan = min(nr_to_scan,
> + (nr_to_write - nr_written) * ZSWAP_PROACTIVE_WB_SCAN_RATIO);
> +
> + while (nr_scanned < nr_to_scan) {
> + unsigned long nr_to_walk = min(ZSWAP_PROACTIVE_WB_BATCH,
> + nr_to_scan - nr_scanned);
> +
> + if (signal_pending(current))
> + return nr_written;
> +
> + /*
> + * Account for the committed budget rather than the walker's
> + * actual delta. If the list is emptied concurrently, the
> + * walker visits nothing and nr_scanned would never advance.
> + */
> + nr_scanned += nr_to_walk;
> +
> + nr_written += list_lru_walk_one(&zswap_list_lru, nid, memcg,
> + &shrink_memcg_cb,
> + &encountered_page_in_swapcache,
> + &nr_to_walk);
> +
> + if (nr_written >= nr_to_write)
> + return nr_written;
> + if (encountered_page_in_swapcache)
> + break;
> +
> + cond_resched();
> + }
> + }
> +
> + return nr_written;
> +}
> +
> +int zswap_proactive_writeback(struct mem_cgroup *memcg,
> + unsigned long nr_to_writeback)
> +{
> + struct mem_cgroup *iter_memcg;
> + unsigned long nr_written = 0;
> + int failures = 0, attempts = 0;
> +
> + if (!memcg)
> + return -EINVAL;
> + if (!nr_to_writeback)
> + return 0;
> +
> + /*
> + * Writeback will be aborted with -EAGAIN if we encounter
> + * the following MAX_RECLAIM_RETRIES times:
> + * - No writeback-candidate memcgs found in a subtree walk.
> + * - A writeback-candidate memcg wrote back zero pages.
> + */
> + while (nr_written < nr_to_writeback) {
> + unsigned long batch_size;
> + long shrunk;
> +
> + if (signal_pending(current))
> + return -EINTR;
> +
> + iter_memcg = zswap_mem_cgroup_iter(memcg);
> +
> + if (!iter_memcg) {
> + /*
> + * Continue without incrementing failures if we found
> + * candidate memcgs in the last subtree walk.
> + */
> + if (!attempts && ++failures == MAX_RECLAIM_RETRIES)
> + return -EAGAIN;
> + attempts = 0;
> + continue;
> + }
> +
> + batch_size = min(nr_to_writeback - nr_written,
> + ZSWAP_PROACTIVE_WB_BATCH);
> + shrunk = zswap_proactive_shrink_memcg(iter_memcg, batch_size);
> + mem_cgroup_put(iter_memcg);
> +
> + /* Writeback-disabled or offline: skip without counting. */
> + if (shrunk == -ENOENT)
> + continue;
> +
> + ++attempts;
> + if (shrunk > 0)
> + nr_written += shrunk;
> + else if (++failures == MAX_RECLAIM_RETRIES)
> + return -EAGAIN;
> +
> + cond_resched();
> + }
> +
> + return 0;
> +}
> +

There is a lot of copy+paste from shrink_worker() and shrink_memcg()
here. We really should be able to reuse shrink_memcg().

Is the main difference that we are scanning in batches here? I think we
can have shrink_memcg() do that too. If anything, it might make the
shrinker more efficient. Over-reclaim is ofc a concern, and especially
in the zswap_store() path as the overhead can be noticeable. Maybe we
can parameterize the batch size based on the code path.

Nhat, what do you think?