Re: rq lock contention due to commit af7f588d8f73

From: Aaron Lu
Date: Wed Mar 29 2023 - 03:46:13 EST


On Tue, Mar 28, 2023 at 08:39:41AM -0400, Mathieu Desnoyers wrote:
> On 2023-03-28 02:58, Aaron Lu wrote:
> > On Mon, Mar 27, 2023 at 03:57:43PM -0400, Mathieu Desnoyers wrote:
> > > I've just resuscitated my per-runqueue concurrency ID cache patch from an older
> > > patchset, and posted it as RFC. So far it passed one round of rseq selftests. Can
> > > you test it in your environment to see if I'm on the right track ?
> > >
> > > https://lore.kernel.org/lkml/20230327195318.137094-1-mathieu.desnoyers@xxxxxxxxxxxx/
> >
> > There are improvements with this patch.
> >
> > When running the client side sysbench with nr_thread=56, the lock contention
> > is gone%; with nr_thread=224(=nr_cpu of this machine), the lock contention
> > dropped from 75% to 27%.
>
> This is a good start!
>
> Can you compare this with Peter's approach to modify init/Kconfig, make
> SCHED_MM_CID a bool, and set it =n in the kernel config ?
>
> I just want to see what baseline we should compare against.
>
> Another test we would want to try here: there is an arbitrary choice for the
> runqueue cache array size in my own patch:
>
> kernel/sched/sched.h:
> # define RQ_CID_CACHE_SIZE 8
>
> Can you try changing this value for 16 or 32 instead and see if it helps?

I tried 32. The short answer is: for nr_thread=224 case, using a larger
value doesn't show obvious difference.

Here is more detailed info.

During a 5 minutes run, I captued 5s perf every 30 seconds. To avoid
getting too huge data recorded by perf since this machine has 224 cpus,
I picked 4 cpus of each node when doing perf record and here are the results:

Your RFC patch that did mm_cid rq cache:
node0_1.profile: 26.07% 26.06% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node0_2.profile: 28.38% 28.37% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node0_3.profile: 25.44% 25.44% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node0_4.profile: 16.14% 16.13% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node0_5.profile: 15.17% 15.16% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node0_6.profile: 5.23% 5.23% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node0_7.profile: 2.64% 2.64% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node0_8.profile: 2.87% 2.87% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node0_9.profile: 2.73% 2.73% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node1_1.profile: 23.78% 23.77% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node1_2.profile: 25.11% 25.10% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node1_3.profile: 21.97% 21.95% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node1_4.profile: 19.37% 19.35% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node1_5.profile: 18.85% 18.84% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node1_6.profile: 11.22% 11.20% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node1_7.profile: 1.65% 1.64% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node1_8.profile: 1.68% 1.67% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node1_9.profile: 1.57% 1.56% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath

Changing RQ_CID_CACHE_SIZE to 32:
node0_1.profile: 29.25% 29.24% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node0_2.profile: 26.87% 26.87% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node0_3.profile: 24.23% 24.23% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node0_4.profile: 17.31% 17.30% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node0_5.profile: 3.61% 3.60% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node0_6.profile: 2.60% 2.59% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node0_7.profile: 1.77% 1.77% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node0_8.profile: 2.14% 2.13% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node0_9.profile: 2.20% 2.20% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node1_1.profile: 27.25% 27.24% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node1_2.profile: 25.12% 25.11% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node1_3.profile: 25.27% 25.26% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node1_4.profile: 19.48% 19.47% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node1_5.profile: 10.21% 10.20% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node1_6.profile: 3.01% 3.00% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node1_7.profile: 1.47% 1.47% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node1_8.profile: 1.52% 1.51% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node1_9.profile: 1.58% 1.56% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath

This workload has a characteristic that in the initial ~2 minutes, it has
more wakeups and task migrations and that probably can explain why lock
contention dropped in later profiles.

As comparison, the vanilla v6.3-rc4:
node0_1.profile: 71.27% 71.26% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node0_2.profile: 72.14% 72.13% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node0_3.profile: 72.68% 72.67% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node0_4.profile: 73.30% 73.29% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node0_5.profile: 77.54% 77.53% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node0_6.profile: 76.05% 76.04% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node0_7.profile: 75.08% 75.07% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node0_8.profile: 75.78% 75.77% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node0_9.profile: 75.30% 75.30% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node1_1.profile: 68.40% 68.40% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node1_2.profile: 69.19% 69.18% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node1_3.profile: 68.74% 68.74% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node1_4.profile: 59.99% 59.98% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node1_5.profile: 56.81% 56.80% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node1_6.profile: 53.46% 53.45% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node1_7.profile: 28.90% 28.88% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node1_8.profile: 27.70% 27.67% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node1_9.profile: 27.17% 27.14% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath

And when CONFIG_SCHED_MM_CID is off on top of v6.3-rc4:
node0_1.profile: 0.09% 0.08% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node0_2.profile: 0.08% 0.08% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node0_3.profile: 0.09% 0.09% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node0_4.profile: 0.10% 0.10% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node0_5.profile: 0.07% 0.07% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node0_6.profile: 0.09% 0.09% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node0_7.profile: 0.15% 0.15% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node0_8.profile: 0.08% 0.08% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node0_9.profile: 0.08% 0.08% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node1_1.profile: 0.23% 0.22% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node1_2.profile: 0.28% 0.28% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node1_3.profile: 2.80% 2.80% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node1_4.profile: 4.29% 4.29% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node1_5.profile: 4.05% 4.05% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node1_6.profile: 2.93% 2.92% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node1_7.profile: 0.07% 0.07% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node1_8.profile: 0.07% 0.07% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node1_9.profile: 0.07% 0.06% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
As for the few profiles on node1 where lock contention is more than
0.3%, I've checked those are from pkg_thermal_notify() which should
be a separate issue.

Thanks,
Aaron