Re: [PATCH v5 9/9] mm: switch deferred split shrinker to list_lru

From: Johannes Weiner

Date: Thu May 28 2026 - 10:12:47 EST

On Thu, May 28, 2026 at 02:32:06PM +0100, Usama Arif wrote:
>
>
> On 27/05/2026 21:45, Johannes Weiner wrote:
> > The deferred split queue handles cgroups in a suboptimal fashion. The
> > queue is per-NUMA node or per-cgroup, not the intersection. That means
> > on a cgrouped system, a node-restricted allocation entering reclaim
> > can end up splitting large pages on other nodes:
> >
> > alloc/unmap
> > deferred_split_folio()
> > list_add_tail(memcg->split_queue)
> > set_shrinker_bit(memcg, node, deferred_shrinker_id)
> >
> > for_each_zone_zonelist_nodemask(restricted_nodes)
> > mem_cgroup_iter()
> > shrink_slab(node, memcg)
> > shrink_slab_memcg(node, memcg)
> > if test_shrinker_bit(memcg, node, deferred_shrinker_id)
> > deferred_split_scan()
> > walks memcg->split_queue
> >
> > The shrinker bit adds an imperfect guard rail. As soon as the cgroup
> > has a single large page on the node of interest, all large pages owned
> > by that memcg, including those on other nodes, will be split.
> >
> > list_lru properly sets up per-node, per-cgroup lists. As a bonus, it
> > streamlines a lot of the list operations and reclaim walks. It's used
> > widely by other major shrinkers already. Convert the deferred split
> > queue as well.
> >
> > The list_lru per-memcg heads are instantiated on demand when the first
> > object of interest is allocated for a cgroup, by calling
> > folio_memcg_alloc_deferred(). Add calls to where splittable pages are
> > created: anon faults, swapin faults, khugepaged collapse.
> >
> > These calls create all possible node heads for the cgroup at once, so
> > the migration code (between nodes) doesn't need any special care.
> >
> > Reported-by: Mikhail Zaslonko <zaslonko@xxxxxxxxxxxxx>
> > Tested-by: Mikhail Zaslonko <zaslonko@xxxxxxxxxxxxx>
> > Acked-by: Shakeel Butt <shakeel.butt@xxxxxxxxx>
> > Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@xxxxxxxxxx>
> > Signed-off-by: Johannes Weiner <hannes@xxxxxxxxxxx>
> > ---
> > include/linux/huge_mm.h | 7 +-
> > include/linux/memcontrol.h | 4 -
> > include/linux/mmzone.h | 12 --
> > mm/huge_memory.c | 364 +++++++++++++------------------------
> > mm/internal.h | 2 +-
> > mm/khugepaged.c | 5 +
> > mm/memcontrol.c | 12 +-
> > mm/memory.c | 4 +
> > mm/mm_init.c | 15 --
> > mm/swap_state.c | 10 +
> > 10 files changed, 150 insertions(+), 285 deletions(-)
> >
>
> [...]
>
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 135f5c0f57bd..f22e61d8c8de 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -5222,6 +5222,10 @@ static struct folio *alloc_anon_folio(struct vm_fault *vmf)
> > folio_put(folio);
> > goto next;
> > }
> > + if (order > 1 && folio_memcg_alloc_deferred(folio)) {
> > + folio_put(folio);
>
> Ah sorry, should have caught this in the previous version, do we need
>
> count_mthp_stat(order, MTHP_STAT_ANON_FAULT_FALLBACK);
>
> here?

This isn't an allocation we expect to fail with any sort of routine
that we'd need to capture it in the event counter. It would warn in
dmesg if it did. But in practice it can't happen at all, since it's a
sub-costly-order slab allocation and the allocator would loop and OOM
kill stuff until it succeeds.

> or maybe we just goto next instead of goto fallback and trty next
> viable order?

Again I don't think it matters, but fallback seems a bit more correct
because the size of the list_lru allocation doesn't change with lower
orders (until we hit 0).

So I think we can just leave it as is.