Re: [PATCH v2] cgroup: avoid css_set_lock in cgroup_css_set_fork()

From: Mateusz Guzik

Date: Tue Feb 10 2026 - 06:19:51 EST

On Tue, Feb 10, 2026 at 11:43 AM Michal Koutný <mkoutny@xxxxxxxx> wrote:
>
> Hello Mateusz.
>

ouch, terribly sorry for "hurry up and wait". real life suddenly got
in the way and I have not looked into this since

> On Thu, Jan 29, 2026 at 02:22:32PM +0100, Michal Koutný <mkoutny@xxxxxxxx> wrote:
> > And I'm wondering whether removal only in cgroup_css_set_fork() improves
> > parallelism because the tasks (before patching) are queued on the first
> > css_set_lock, serialized through the first critical section and when
> > they arrive to the second critical section in cgroup_post_fork() their
> > arrival rate is already reduced because they had to pass through the
> > first critical section. Hence the 2nd pass through the critical section
> > should be less contended (w/out waiting).
>

it improves parallelism because total hold time goes down.

first, there is a little less work to do with the lock in the first
place even absent any contention

second, there is less total overhead in terms of bouncing the lock and
the cachelines used by the code protected by it. note any contention
means the bouncing already happens

you can see the second effect in my patch which does not reduce the
amount of work per se, but merely avoids a case where someone is
halfway through alloc_pid and has to wait

Ignoring some single-threaded overhead from the atomics in rwlock I
very much expect scalability to be about the same as with the seqlock,
but only because of the bottlenecks elsewhere.

While I don't understand why would you go for rwlock here, I'm not
going to protest -- it still moves the css lock out of the picture.

> I was still curious about this, so I tried own measurement.
> I ran your clone'ing will-it-scale testcase [1].
> Basically it was
> clone_processes -s 1000 -t 40
> on a 40 CPUs/80 SMTs machine.
> I watched for the `total:` iteration counts reported by wis
> periodically.
>
> 6.18.8-0-default (baseline := stable + pidmap patches [2][3])
> 2.9383e+05 ± 1135.5
>
> 6.18.8-1.g886f4c4-default (baseline + rwlock impl (previous message))
> 2.9363e+05 ± 1219.8
>
> 6.18.8-1.gb21e8f8-default (baseline + seqcount impl (your patch))
> 2.9147e+05 ± 1125.6
>
> So I could not reproduce any non-random change with this css_set_lock
> split (I consider even the apparent difference between implementations
> rather random).

This is going to depend on the scale you test on. I was testing on
south of 32. But I also got a miniscule win from removing css set lock
as the problem for me, instead everything shifted to tasklist.

Per my other e-mail tasklist lock retains the terrible 3-times locking
and it is doing rather expensive work while holding it. It is
plausible it happens to be at the top at that scale, but that's only
an argument for fixing it. Even if you don't see the css thing at the
top at the moment, it will be there once someone(tm) sorts out the
tasklist problem.

>
> At this point, I should look into profiles whether the bottleneck is
> really css_set_lock in cgroup_post_fork() but I'm sharing what I have,
> glad for your possible insights.
>
> Regards,
> Michal
>
> [1] Only clone_process variant, clone_threads randomly hung.
> will-it-scale/glibc (2.42-3.1) likely doesn't work well with the
> cancellation/(no) join (but I got hangs even with pthread cleanup
> handlers that joined the child thread)
>
> #0 futex_wait (futex_word=0x7ffff7ffd840 <_rtld_local+2112>, expected=2, private=0) at ../sysdeps/nptl/futex-internal.h:146
> #1 __GI___lll_lock_wait_private (futex=0x7ffff7ffd840 <_rtld_local+2112>) at lowlevellock.c:34
> #2 0x00007ffff7c98d69 in __GI___nptl_deallocate_stack (pd=0x7ffff7ab16c0) at nptl-stack.c:113
> ...
> #5 0x00000000004029ca in kill_tasks () at main.c:151
>
> [2] https://lore.kernel.org/linux-mm/20251206131955.780557-1-mjguzik@xxxxxxxxx/
> [3] Those patched improved the metric about some 10% (but I haven't
> measured this difference so thoroughly).