Re: [PATCH v5 9/9] mm: switch deferred split shrinker to list_lru
From: Wei Yang
Date: Sun May 31 2026 - 04:02:54 EST
On Wed, May 27, 2026 at 04:45:16PM -0400, Johannes Weiner wrote:
>The deferred split queue handles cgroups in a suboptimal fashion. The
>queue is per-NUMA node or per-cgroup, not the intersection. That means
>on a cgrouped system, a node-restricted allocation entering reclaim
>can end up splitting large pages on other nodes:
>
> alloc/unmap
> deferred_split_folio()
> list_add_tail(memcg->split_queue)
> set_shrinker_bit(memcg, node, deferred_shrinker_id)
>
> for_each_zone_zonelist_nodemask(restricted_nodes)
> mem_cgroup_iter()
> shrink_slab(node, memcg)
> shrink_slab_memcg(node, memcg)
> if test_shrinker_bit(memcg, node, deferred_shrinker_id)
> deferred_split_scan()
> walks memcg->split_queue
>
>The shrinker bit adds an imperfect guard rail. As soon as the cgroup
>has a single large page on the node of interest, all large pages owned
>by that memcg, including those on other nodes, will be split.
>
>list_lru properly sets up per-node, per-cgroup lists. As a bonus, it
>streamlines a lot of the list operations and reclaim walks. It's used
>widely by other major shrinkers already. Convert the deferred split
>queue as well.
>
>The list_lru per-memcg heads are instantiated on demand when the first
>object of interest is allocated for a cgroup, by calling
>folio_memcg_alloc_deferred(). Add calls to where splittable pages are
>created: anon faults, swapin faults, khugepaged collapse.
>
>These calls create all possible node heads for the cgroup at once, so
>the migration code (between nodes) doesn't need any special care.
>
>Reported-by: Mikhail Zaslonko <zaslonko@xxxxxxxxxxxxx>
>Tested-by: Mikhail Zaslonko <zaslonko@xxxxxxxxxxxxx>
>Acked-by: Shakeel Butt <shakeel.butt@xxxxxxxxx>
>Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@xxxxxxxxxx>
>Signed-off-by: Johannes Weiner <hannes@xxxxxxxxxxx>
>---
> include/linux/huge_mm.h | 7 +-
> include/linux/memcontrol.h | 4 -
> include/linux/mmzone.h | 12 --
> mm/huge_memory.c | 364 +++++++++++++------------------------
> mm/internal.h | 2 +-
> mm/khugepaged.c | 5 +
> mm/memcontrol.c | 12 +-
> mm/memory.c | 4 +
> mm/mm_init.c | 15 --
> mm/swap_state.c | 10 +
> 10 files changed, 150 insertions(+), 285 deletions(-)
>
[...]
>@@ -1379,6 +1285,14 @@ static struct folio *vma_alloc_anon_folio_pmd(struct vm_area_struct *vma,
> count_mthp_stat(order, MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE);
> return NULL;
> }
>+
>+ if (folio_memcg_alloc_deferred(folio)) {
>+ folio_put(folio);
>+ count_vm_event(THP_FAULT_FALLBACK);
>+ count_mthp_stat(order, MTHP_STAT_ANON_FAULT_FALLBACK);
>+ return NULL;
>+ }
>+
Nit: we have three possible failure point, and some duplicate
count_xxx_event/state().
Maybe we can have a followup cleanup for it.
Others, looks good. Thanks.
--
Wei Yang
Help you, Help me