Re: [PATCH v5 9/9] mm: switch deferred split shrinker to list_lru

From: Kairui Song

Date: Fri May 29 2026 - 14:08:29 EST


On Thu, May 28, 2026 at 4:48 AM Johannes Weiner <hannes@xxxxxxxxxxx> wrote:
>
> The deferred split queue handles cgroups in a suboptimal fashion. The
> queue is per-NUMA node or per-cgroup, not the intersection. That means
> on a cgrouped system, a node-restricted allocation entering reclaim
> can end up splitting large pages on other nodes:
>
> alloc/unmap
> deferred_split_folio()
> list_add_tail(memcg->split_queue)
> set_shrinker_bit(memcg, node, deferred_shrinker_id)
>
> for_each_zone_zonelist_nodemask(restricted_nodes)
> mem_cgroup_iter()
> shrink_slab(node, memcg)
> shrink_slab_memcg(node, memcg)
> if test_shrinker_bit(memcg, node, deferred_shrinker_id)
> deferred_split_scan()
> walks memcg->split_queue
>
> The shrinker bit adds an imperfect guard rail. As soon as the cgroup
> has a single large page on the node of interest, all large pages owned
> by that memcg, including those on other nodes, will be split.
>
> list_lru properly sets up per-node, per-cgroup lists. As a bonus, it
> streamlines a lot of the list operations and reclaim walks. It's used
> widely by other major shrinkers already. Convert the deferred split
> queue as well.
>
> The list_lru per-memcg heads are instantiated on demand when the first
> object of interest is allocated for a cgroup, by calling
> folio_memcg_alloc_deferred(). Add calls to where splittable pages are
> created: anon faults, swapin faults, khugepaged collapse.
>
> These calls create all possible node heads for the cgroup at once, so
> the migration code (between nodes) doesn't need any special care.
>
> Reported-by: Mikhail Zaslonko <zaslonko@xxxxxxxxxxxxx>
> Tested-by: Mikhail Zaslonko <zaslonko@xxxxxxxxxxxxx>
> Acked-by: Shakeel Butt <shakeel.butt@xxxxxxxxx>
> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@xxxxxxxxxx>
> Signed-off-by: Johannes Weiner <hannes@xxxxxxxxxxx>
> ---
> include/linux/huge_mm.h | 7 +-
> include/linux/memcontrol.h | 4 -
> include/linux/mmzone.h | 12 --
> mm/huge_memory.c | 364 +++++++++++++------------------------
> mm/internal.h | 2 +-
> mm/khugepaged.c | 5 +
> mm/memcontrol.c | 12 +-
> mm/memory.c | 4 +
> mm/mm_init.c | 15 --
> mm/swap_state.c | 10 +
> 10 files changed, 150 insertions(+), 285 deletions(-)

...

> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 04f5ce992401..9c3a5cf99778 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -465,6 +465,16 @@ static struct folio *__swap_cache_alloc(struct swap_cluster_info *ci,
> return ERR_PTR(-ENOMEM);
> }
>
> + if (order > 1 && folio_memcg_alloc_deferred(folio)) {
> + spin_lock(&ci->lock);
> + __swap_cache_do_del_folio(ci, folio, entry, shadow);
> + spin_unlock(&ci->lock);
> + folio_unlock(folio);
> + /* nr_pages refs from swap cache, 1 from allocation */
> + folio_put_refs(folio, nr_pages + 1);
> + return ERR_PTR(-ENOMEM);
> + }
> +

Thanks!

Nit: I think it would be better if we move the error handling under a
label to be shared with the charge failure above, but it's fine this
way, can be simplified later.

Reviewed-by: Kairui Song <kasong@xxxxxxxxxxx>