Re: [PATCH v5 9/9] mm: switch deferred split shrinker to list_lru

From: Usama Arif

Date: Thu May 28 2026 - 09:38:39 EST

On 27/05/2026 21:45, Johannes Weiner wrote:
> The deferred split queue handles cgroups in a suboptimal fashion. The
> queue is per-NUMA node or per-cgroup, not the intersection. That means
> on a cgrouped system, a node-restricted allocation entering reclaim
> can end up splitting large pages on other nodes:
>
> alloc/unmap
> deferred_split_folio()
> list_add_tail(memcg->split_queue)
> set_shrinker_bit(memcg, node, deferred_shrinker_id)
>
> for_each_zone_zonelist_nodemask(restricted_nodes)
> mem_cgroup_iter()
> shrink_slab(node, memcg)
> shrink_slab_memcg(node, memcg)
> if test_shrinker_bit(memcg, node, deferred_shrinker_id)
> deferred_split_scan()
> walks memcg->split_queue
>
> The shrinker bit adds an imperfect guard rail. As soon as the cgroup
> has a single large page on the node of interest, all large pages owned
> by that memcg, including those on other nodes, will be split.
>
> list_lru properly sets up per-node, per-cgroup lists. As a bonus, it
> streamlines a lot of the list operations and reclaim walks. It's used
> widely by other major shrinkers already. Convert the deferred split
> queue as well.
>
> The list_lru per-memcg heads are instantiated on demand when the first
> object of interest is allocated for a cgroup, by calling
> folio_memcg_alloc_deferred(). Add calls to where splittable pages are
> created: anon faults, swapin faults, khugepaged collapse.
>
> These calls create all possible node heads for the cgroup at once, so
> the migration code (between nodes) doesn't need any special care.
>
> Reported-by: Mikhail Zaslonko <zaslonko@xxxxxxxxxxxxx>
> Tested-by: Mikhail Zaslonko <zaslonko@xxxxxxxxxxxxx>
> Acked-by: Shakeel Butt <shakeel.butt@xxxxxxxxx>
> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@xxxxxxxxxx>
> Signed-off-by: Johannes Weiner <hannes@xxxxxxxxxxx>
> ---
> include/linux/huge_mm.h | 7 +-
> include/linux/memcontrol.h | 4 -
> include/linux/mmzone.h | 12 --
> mm/huge_memory.c | 364 +++++++++++++------------------------
> mm/internal.h | 2 +-
> mm/khugepaged.c | 5 +
> mm/memcontrol.c | 12 +-
> mm/memory.c | 4 +
> mm/mm_init.c | 15 --
> mm/swap_state.c | 10 +
> 10 files changed, 150 insertions(+), 285 deletions(-)
>

[...]

> diff --git a/mm/memory.c b/mm/memory.c
> index 135f5c0f57bd..f22e61d8c8de 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -5222,6 +5222,10 @@ static struct folio *alloc_anon_folio(struct vm_fault *vmf)
> folio_put(folio);
> goto next;
> }
> + if (order > 1 && folio_memcg_alloc_deferred(folio)) {
> + folio_put(folio);

Ah sorry, should have caught this in the previous version, do we need

count_mthp_stat(order, MTHP_STAT_ANON_FAULT_FALLBACK);

here?

or maybe we just goto next instead of goto fallback and trty next
viable order?

> + goto fallback;
> + }
> folio_throttle_swaprate(folio, gfp);
> /*
> * When a folio is not zeroed during allocation