Re: [PATCH v2] cgroup: avoid css_set_lock in cgroup_css_set_fork()
From: Mateusz Guzik
Date: Wed Mar 11 2026 - 10:42:19 EST
So I booted up a vm with 80 hw threads and the cgroup lock is still
top of the profile for me when rolling with ./threadspawn1_processes
-t 80
While I prefer my patch on the grounds it reduces overhead to begin
with (fewer locking trips), I wont argue against yours. My primary
goal here is to get cgroups out of the way.
or to put it differently, can you either ack my patch or push yours?
On Tue, Feb 10, 2026 at 6:33 PM Mateusz Guzik <mjguzik@xxxxxxxxx> wrote:
>
> On Tue, Feb 10, 2026 at 5:55 PM Michal Koutný <mkoutny@xxxxxxxx> wrote:
> >
> > On Tue, Feb 10, 2026 at 12:19:27PM +0100, Mateusz Guzik <mjguzik@xxxxxxxxx> wrote:
> > > This is going to depend on the scale you test on. I was testing on
> > > south of 32. But I also got a miniscule win from removing css set lock
> > > as the problem for me, instead everything shifted to tasklist.
> >
> > To be on the same page -- that means you have nr_cpus >= 32?
> >
>
> south means less
>
> > > Per my other e-mail tasklist lock retains the terrible 3-times locking
> > > and it is doing rather expensive work while holding it. It is
> > > plausible it happens to be at the top at that scale, but that's only
> > > an argument for fixing it. Even if you don't see the css thing at the
> > > top at the moment, it will be there once someone(tm) sorts out the
> > > tasklist problem.
> >
> > I did a quick test (with 6.18.8-1.g886f4c4-default), first `perf top`
> > while will-it-scale was running:
>
> I don't know what this hash corresponds to.
>
> >
> > 74.23% [kernel] [k] native_queued_spin_lock_slowpath
> > 6.91% [kernel] [k] intel_idle_irq
> > 0.87% [kernel] [k] update_sd_lb_stats.constprop.0
> > 0.68% [kernel] [k] _raw_spin_lock
> > 0.63% [kernel] [k] clear_page_erms
> > 0.56% [kernel] [k] sched_balance_find_dst_group
> > 0.40% [kernel] [k] alloc_vmap_area
> >
> > and then bpftrace for the waiters:
> > $ bpftrace -e 'kprobe:native_queued_spin_lock_slowpath {@[arg0]=count();}
> > END {for($kv : @) {printf("%s\t%d\n", ksym($kv.0), (int64)$kv.1);} clear(@); }'\
> > >bpftrace.out
> > $ sort -k2 -r -n bpftrace.out | head | column -t
> > pidmap_lock 10482583
> > nft_pcpu_tun_ctx 3693517
> > css_set_lock 1511164
> > input_pool 976252
> > tasklist_lock 798578
> > nft_pcpu_tun_ctx 481962
> > 0xffff8abc3ffd55b0 95371
> > 0xffff8a6d3ffd65b0 93686
> > 0xffff8a5e218f0840 29501
> > 0xffff8a5e451dca40 29421
> >
> > or measured by cummulative waiting time:
> > $ bpftrace -e 'kprobe:native_queued_spin_lock_slowpath {@[cpu]=arg0; @st[cpu]=nsecs;}
> > kretprobe:native_queued_spin_lock_slowpath /@[cpu]/ {$lat=nsecs-@st[cpu]; @lats[@[cpu]]=sum($lat);}
> > END {for($kv : @lats) {printf("%s\t%d\n", ksym($kv.0), (int64)$kv.1);} clear(@lats); clear(@st); clear(@) }'\
> > >bpftrace2.out
> >
> > $ sort -k2 -r -n bpftrace2.out | head -n15 | column -t
> > pidmap_lock 1931209805
> > rcu_state 1823286316
> > rcu_state 1581455156
> > rcu_state 1328804835
> > rcu_state 1299517157
> > rcu_state 1134101627
> > nft_pcpu_tun_ctx 1027837665
> > 0xffff8abc3ffd55b0 861441978
> > 0xffff8a6d3ffd65b0 850732998
> > css_set_lock 520009479
> > input_pool 316598763
> > tasklist_lock 127161061
> > 0xffff8aac40023200 32380418
> > 0xffff8a5e002ab600 30194951
> > rcu_state 18334578
> >
>
> If the only thing you applied is the patchset over at
> https://lore.kernel.org/linux-mm/20251206131955.780557-1-mjguzik@xxxxxxxxx/
> , then this lines up with my own measurements, where I said the pidmap
> lock remains dominant.
>
> That thing gets unclogged with a patch by Christian to move pidmap
> handling out, which can be found here:
> https://lore.kernel.org/all/20260120-work-pidfs-rhashtable-v2-1-d593c4d0f576@xxxxxxxxxx/
>
> Afterwards it is css_set_lock at the top of the profile.
>
> > Hm, it's interesting that is suggestive of why I saw no big change with
> > css_set_lock in my setup.
> >
>
> Regardless, of the above, I noted sorting out this lock does not
> meaningfully improve performance, it merely shifts contention to
> tasklist afterwards.
>
> >
> > Michal