Re: [PATCH V2 3/4] mm: shrinker: optimize the allocation of shrinker_info when setting cgroup_memory_nokmem

From: Haifeng Xu

Date: Thu Mar 12 2026 - 00:09:50 EST

On 2026/3/12 06:14, Dave Chinner wrote:
> On Tue, Mar 10, 2026 at 11:12:49AM +0800, Haifeng Xu wrote:
>> When kmem is disabled, memcg slab shrink only call non-slab shrinkers,
>> so just allocates shrinker info for non-slab shrinkers to non-root memcgs.
>>
>> Therefore, if memcg_kmem_online is true, all things keep same as before.
>> Otherwise, root memcg allocates id from shrinker_idr to identify each
>> shrinker and non-root memcgs use nonslab_id to identify non-slab shrinkers.
>> The size of shrinkers_info in non-root memcgs can be very low because the
>> number of shrinkers marked as SHRINKER_NONSLAB | SHRINKER_MEMCG_AWARE is
>> few. Also, the time spending in expand_shrinker_info() can reduce a lot.
>>
>> When setting shrinker bit or updating nr_deferred, use nonslab_id for
>> non-root memcgs if the shrinker is marked as SHRINKER_NONSLAB.
>>
>> Signed-off-by: Haifeng Xu <haifeng.xu@xxxxxxxxxx>
>> ---
>> include/linux/memcontrol.h | 8 ++-
>> include/linux/shrinker.h | 3 +
>> mm/shrinker.c | 116 +++++++++++++++++++++++++++++++++----
>> 3 files changed, 114 insertions(+), 13 deletions(-)
>>
>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>> index ce7b5101bc02..3edd6211aed2 100644
>> --- a/include/linux/memcontrol.h
>> +++ b/include/linux/memcontrol.h
>> @@ -1804,7 +1804,13 @@ void reparent_shrinker_deferred(struct mem_cgroup *memcg);
>>
>> static inline int shrinker_id(struct mem_cgroup *memcg, struct shrinker *shrinker)
>> {
>> - return shrinker->id;
>> + int id = shrinker->id;
>> +
>> + if (!memcg_kmem_online() && (shrinker->flags & SHRINKER_NONSLAB) &&
>> + memcg != root_mem_cgroup)
>> + id = shrinker->nonslab_id;
>> +
>> + return id;
>> }
>> #else
>> #define mem_cgroup_sockets_enabled 0
>> diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
>> index 1a00be90d93a..df53008ed8b5 100644
>> --- a/include/linux/shrinker.h
>> +++ b/include/linux/shrinker.h
>> @@ -107,6 +107,9 @@ struct shrinker {
>> #ifdef CONFIG_MEMCG
>> /* ID in shrinker_idr */
>> int id;
>> +
>> + /* ID in shrinker_nonslab_idr */
>> + int nonslab_id;
>> #endif
>> #ifdef CONFIG_SHRINKER_DEBUG
>> int debugfs_id;
>> diff --git a/mm/shrinker.c b/mm/shrinker.c
>> index 61dbb6afae52..68ea2d49495c 100644
>> --- a/mm/shrinker.c
>> +++ b/mm/shrinker.c
>> @@ -12,6 +12,7 @@ DEFINE_MUTEX(shrinker_mutex);
>>
>> #ifdef CONFIG_MEMCG
>> static int shrinker_nr_max;
>> +static int shrinker_nonslab_nr_max;
>>
>> static inline int shrinker_unit_size(int nr_items)
>> {
>> @@ -78,15 +79,25 @@ int alloc_shrinker_info(struct mem_cgroup *memcg)
>> {
>> int nid, ret = 0;
>> int array_size = 0;
>> + int alloc_nr_max;
>> +
>> + if (memcg_kmem_online()) {
>> + alloc_nr_max = shrinker_nr_max;
>> + } else {
>> + if (memcg == root_mem_cgroup)
>> + alloc_nr_max = shrinker_nr_max;
>> + else
>> + alloc_nr_max = shrinker_nonslab_nr_max;
>> + }
>
> What does this do and why does it exist? Why do we need two
> different indexes and tracking structures when memcg is disabled?
>
> If I look at this code outside of this commit context, I have -zero-
> idea of what all this ... complexity does or is needed for.
>
> AFAICT, the code is trying to reduce memcg-aware shrinker
> registration overhead, yes?
>
> If so, please explain where all the overhead is in the first place -
> if there's a time saving of hundreds of seconds in your workload,
> then whatever is causing the overhead is going to show up in CPU
> profiles. What, exactly, is causing all the registration overhead?
>
> i.e. there are lots of workloads that create large numbers of
> containers when memcg is actually enabled, so if registration is
> costly then the right thing to do here is fix the registration
> overhead problem.
>
> Hacking custom logic into the code to avoid the overhead in your
> specific special case so you can ignore the problem is not the way
> we solve problems. We need to solve problems like this in a way that
> benefits -everyone- regardless of whether they are using memcgs or
> not.
>
> So, please identify where all the overhead in memcg shrinker
> registration is, and then we can take steps to improve the
> registration code -for everyone-.
>
> -Dave.

When creating containers, we found many threads got stuck and wait shrinker
lock in our machine with kmem disabled. And we found that the shrinker lock
was held for a long time when expanding shrinker info. As the number of
containers increases，the lock holding time can increase from a few milliseconds
to over one hundred milliseconds.

Call stack can be seen as below(based on stable kernel 6.6.102).

PID: 4462 TASK: ffff8eff5ca0b500 CPU: 79 COMMAND: "runc:[2:INIT]"
#0 [ffffc9005b213b10] __schedule at ffffffffa3ad84c0
#1 [ffffc9005b213bb8] schedule at ffffffffa3ad8988
#2 [ffffc9005b213bd8] schedule_preempt_disabled at ffffffffa3ad8bae
#3 [ffffc9005b213be8] rwsem_down_write_slowpath at ffffffffa3adcc5e
#4 [ffffc9005b213ca8] down_write at ffffffffa3adcf3c
#5 [ffffc9005b213cc0] __prealloc_shrinker at ffffffffa2db3bf0
#6 [ffffc9005b213d08] prealloc_shrinker at ffffffffa2db9e0e
#7 [ffffc9005b213d18] alloc_super at ffffffffa2ebec49
#8 [ffffc9005b213d48] sget_fc at ffffffffa2ebff48
#9 [ffffc9005b213d88] get_tree_nodev at ffffffffa2ec0578
#10 [ffffc9005b213dc0] shmem_get_tree at ffffffffa2dbf275
#11 [ffffc9005b213dd0] vfs_get_tree at ffffffffa2ebe6a7
#12 [ffffc9005b213df8] do_new_mount at ffffffffa2ef2250
#13 [ffffc9005b213e50] path_mount at ffffffffa2ef2eb0
#14 [ffffc9005b213eb8] __x64_sys_mount at ffffffffa2ef3617
#15 [ffffc9005b213f08] x64_sys_call at ffffffffa2a07488
#16 [ffffc9005b213f18] do_syscall_64 at ffffffffa3ac9e36
RIP: 00007f726fe48eee RSP: 000000c00019b3e8 RFLAGS: 00000206
RAX: ffffffffffffffda RBX: 000000c00020a8e6 RCX: 00007f726fe48eee
RDX: 000000c00020a8f0 RSI: 000000c000216ed0 RDI: 000000c00020a8e6
RBP: 000000c00019b428 R8: 0000000000000000 R9: 0000000000000000
R10: 0000000000000001 R11: 0000000000000206 R12: 000000c000216ed0
R13: 00000000000000aa R14: 000000c000002380 R15: 0000000000000000
ORIG_RAX: 00000000000000a5 CS: 0033 SS: 002b

We use perf tool to record the cpu consuming. I have posted it in the attachment.
>From the Flame Graph, we can see that the clear_page_erms() and memcpy() in
expand_one_shrinker_info() are the main sources of overhead.

Therefore, the more shrinkers and memcgs exsit，the process of expanding shrinker
info takes longer. This is because when expanding shrinker info, we will traverse
all memcgs an record all shrinkers for them.

However, with kmem disabled, memcg slab shrink only call non-slab shrinkers, that's
to say, we only need to record non-slab shrinker for non-root memcgs. For root
memcg, we still need to record all shrinkers because global shrink call all shrinkers.

In order to reduce allocation size and decrease lock holding time, introduce a new idr
to allocate nonslab_id for non-slab shrinkers. The nonslab_id is only used in non-root
memcgs to record non-slab shrinker. Now only one shrinker(deferred_split_shrinker) is
marked as SHRINKER_NONSLAB | SHRINKER_MEMCG_AWARE, the registeration of any other shrinker
won't trigger expansion of shrinker info in non-root memcgs. Only root memcg need to
allocate id from shrinker_idr and do expansion check for all shrinkers. The number of memcgs
which need to expand shrinker info can be reduced from n(all memcgs) to one(root memcg).

Attachment: perf.svg
Description: image/svg