Re: [PATCH v4 1/5] mm/zswap: Extend shrink_memcg() writeback capability

From: Hao Jia

Date: Wed Jun 24 2026 - 08:01:51 EST




On 2026/6/24 02:17, Yosry Ahmed wrote:
My initial thought was that if cold memory is evenly distributed across
nodes and we are doing a large writeback, it would be better to balance
the zswap entry writeback across all nodes rather than just draining
node 0 first. However, since we currently lack a proper metric to
represent hot/cold memory (such as age-based tracking), doing this
probably doesn't make much sense right now.

Yeah let's start simple and go from there.


So, perhaps we want something like this? Please correct me if I'm wrong.

static long shrink_memcg(struct mem_cgroup *memcg,
unsigned long nr_to_scan)
{
struct zswap_shrink_walk_arg walk_arg = {
.bytes_written = 0,
.encountered_page_in_swapcache = false,
};
unsigned long nr_remaining = nr_to_scan;
bool memcg_list_is_empty = true;
int nid;

if (!mem_cgroup_zswap_writeback_enabled(memcg))
return -ENOENT;

if (memcg && !mem_cgroup_online(memcg))
return -ENOENT;

for_each_node_state(nid, N_NORMAL_MEMORY) {
unsigned long nr_to_walk;

/*
* Cap the per-node scan by the current LRU length. A referenced
* entry is only rotated to the tail (second chance) and may be
* revisited within a single walk; without this cap those rotated
* entries could drain the shared scan budget on one node.
*/

The comment here is a bit misleading. It's not just about draining one
node. One call to shrink_memcg() should only scan entries once. The
caller can then choose to scan the memcg again, or scan a different
one. In this case, the caller should iterate all memcgs first before
retrying memcgs again and reclaiming rotated entries.

I have updated the comment. Please see below.

nr_to_walk = min(nr_remaining,
list_lru_count_one(&zswap_list_lru, nid, memcg));
if (!nr_to_walk)
continue;
memcg_list_is_empty = false;

nr_remaining -= nr_to_walk;
list_lru_walk_one(&zswap_list_lru, nid, memcg,
&shrink_memcg_cb, &walk_arg, &nr_to_walk);
/* Return the unused share of the budget to the pool. */
nr_remaining += nr_to_walk;

/* Bail out once the whole scan budget has been spent. */

The comment is unnecessary.

I'll do this, thanks.

if (!nr_remaining)
break;

cond_resched();

Did you observe a problem here or did you just add this due to an
abundance of caution?

The cond_resched() here was just out of caution. Given that both callers (shrink_worker() and zswap_proactive_writeback()) already have rescheduling checks, I suppose we can remove it from here."

}

if (memcg_list_is_empty)

Do we need memcg_list_is_empty? Can we just check if nr_remaining
matches nr_to_scan?


indeed.
return -ENOENT;

return walk_arg.bytes_written;
}


/*
* Scan up to @nr_to_scan pages across the per-node zswap LRUs of @memcg
* and write back the reclaimable ones.
*
* Since the second-chance algorithm rotates referenced entries to the
* LRU tail, the per-node scan is capped at the current LRU length so
* each entry is scanned at most once per call. It is up to the caller
* to handle retries, deciding whether to scan the next memcg to complete
* the full iteration, or to rescan the current memcg to drain its zswap
* entries.
*
* Return: The number of compressed bytes written back (>= 0), or -ENOENT
* if @memcg has writeback disabled, is a zombie cgroup, or has empty
* zswap LRUs.
*/
static long shrink_memcg(struct mem_cgroup *memcg, unsigned long nr_to_scan)
{
struct zswap_shrink_walk_arg walk_arg = {
.bytes_written = 0,
.encountered_page_in_swapcache = false,
};
unsigned long nr_remaining = nr_to_scan;
int nid;

if (!mem_cgroup_zswap_writeback_enabled(memcg))
return -ENOENT;

/*
* Skip zombies because their LRUs are reparented and we would be
* reclaiming from the parent instead of the dead memcg.
*/
if (memcg && !mem_cgroup_online(memcg))
return -ENOENT;

for_each_node_state(nid, N_NORMAL_MEMORY) {
unsigned long nr_to_walk;

/*
* Cap the walk at the current LRU length to ensure each entry is
* scanned at most once per call. Referenced entries are rotated
* to the tail for a second chance, and this bound prevents them
* from being revisited within a single call. Retries are left to
* the caller, which can choose to rescan the current memcg or
* move on to the next one.
*/
nr_to_walk = min(nr_remaining,
list_lru_count_one(&zswap_list_lru, nid, memcg));
if (!nr_to_walk)
continue;

nr_remaining -= nr_to_walk;
list_lru_walk_one(&zswap_list_lru, nid, memcg, &shrink_memcg_cb,
&walk_arg, &nr_to_walk);
/* Return the unused share of the budget to the pool. */
nr_remaining += nr_to_walk;

if (!nr_remaining)
break;
}

/* Nothing was scanned: every LRU under @memcg was empty. */
if (nr_remaining == nr_to_scan)
return -ENOENT;

return walk_arg.bytes_written;
}


Thanks,
Hao