Re: [PATCH v5 9/9] mm: switch deferred split shrinker to list_lru

From: Usama Arif

Date: Thu May 28 2026 - 11:58:15 EST

On 28/05/2026 15:02, Johannes Weiner wrote:
> On Thu, May 28, 2026 at 02:32:06PM +0100, Usama Arif wrote:
>>
>>
>> On 27/05/2026 21:45, Johannes Weiner wrote:
>>> The deferred split queue handles cgroups in a suboptimal fashion. The
>>> queue is per-NUMA node or per-cgroup, not the intersection. That means
>>> on a cgrouped system, a node-restricted allocation entering reclaim
>>> can end up splitting large pages on other nodes:
>>>
>>> alloc/unmap
>>> deferred_split_folio()
>>> list_add_tail(memcg->split_queue)
>>> set_shrinker_bit(memcg, node, deferred_shrinker_id)
>>>
>>> for_each_zone_zonelist_nodemask(restricted_nodes)
>>> mem_cgroup_iter()
>>> shrink_slab(node, memcg)
>>> shrink_slab_memcg(node, memcg)
>>> if test_shrinker_bit(memcg, node, deferred_shrinker_id)
>>> deferred_split_scan()
>>> walks memcg->split_queue
>>>
>>> The shrinker bit adds an imperfect guard rail. As soon as the cgroup
>>> has a single large page on the node of interest, all large pages owned
>>> by that memcg, including those on other nodes, will be split.
>>>
>>> list_lru properly sets up per-node, per-cgroup lists. As a bonus, it
>>> streamlines a lot of the list operations and reclaim walks. It's used
>>> widely by other major shrinkers already. Convert the deferred split
>>> queue as well.
>>>
>>> The list_lru per-memcg heads are instantiated on demand when the first
>>> object of interest is allocated for a cgroup, by calling
>>> folio_memcg_alloc_deferred(). Add calls to where splittable pages are
>>> created: anon faults, swapin faults, khugepaged collapse.
>>>
>>> These calls create all possible node heads for the cgroup at once, so
>>> the migration code (between nodes) doesn't need any special care.
>>>
>>> Reported-by: Mikhail Zaslonko <zaslonko@xxxxxxxxxxxxx>
>>> Tested-by: Mikhail Zaslonko <zaslonko@xxxxxxxxxxxxx>
>>> Acked-by: Shakeel Butt <shakeel.butt@xxxxxxxxx>
>>> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@xxxxxxxxxx>
>>> Signed-off-by: Johannes Weiner <hannes@xxxxxxxxxxx>
>>> ---
>>> include/linux/huge_mm.h | 7 +-
>>> include/linux/memcontrol.h | 4 -
>>> include/linux/mmzone.h | 12 --
>>> mm/huge_memory.c | 364 +++++++++++++------------------------
>>> mm/internal.h | 2 +-
>>> mm/khugepaged.c | 5 +
>>> mm/memcontrol.c | 12 +-
>>> mm/memory.c | 4 +
>>> mm/mm_init.c | 15 --
>>> mm/swap_state.c | 10 +
>>> 10 files changed, 150 insertions(+), 285 deletions(-)
>>>
>>
>> [...]
>>
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index 135f5c0f57bd..f22e61d8c8de 100644
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>> @@ -5222,6 +5222,10 @@ static struct folio *alloc_anon_folio(struct vm_fault *vmf)
>>> folio_put(folio);
>>> goto next;
>>> }
>>> + if (order > 1 && folio_memcg_alloc_deferred(folio)) {
>>> + folio_put(folio);
>>
>> Ah sorry, should have caught this in the previous version, do we need
>>
>> count_mthp_stat(order, MTHP_STAT_ANON_FAULT_FALLBACK);
>>
>> here?
>
> This isn't an allocation we expect to fail with any sort of routine
> that we'd need to capture it in the event counter. It would warn in
> dmesg if it did. But in practice it can't happen at all, since it's a
> sub-costly-order slab allocation and the allocator would loop and OOM
> kill stuff until it succeeds.
>
>> or maybe we just goto next instead of goto fallback and trty next
>> viable order?
>
> Again I don't think it matters, but fallback seems a bit more correct
> because the size of the list_lru allocation doesn't change with lower
> orders (until we hit 0).
>
> So I think we can just leave it as is.

Ack!

Acked-by: Usama Arif <usama.arif@xxxxxxxxx>