Re: [RFC PATCH 0/7] Optimization to reduce the cost of newidle balance

From: K Prateek Nayak
Date: Thu Jul 18 2024 - 05:28:55 EST


Hello Peter,

On 7/17/2024 5:47 PM, Peter Zijlstra wrote:
On Thu, Jul 27, 2023 at 10:33:58PM +0800, Chen Yu wrote:
Hi,

This is the second version of the newidle balance optimization[1].
It aims to reduce the cost of newidle balance which is found to
occupy noticeable CPU cycles on some high-core count systems.

For example, when running sqlite on Intel Sapphire Rapids, which has
2 x 56C/112T = 224 CPUs:

6.69% 0.09% sqlite3 [kernel.kallsyms] [k] newidle_balance
5.39% 4.71% sqlite3 [kernel.kallsyms] [k] update_sd_lb_stats

To mitigate this cost, the optimization is inspired by the question
raised by Tim:
Do we always have to find the busiest group and pull from it? Would
a relatively busy group be enough?

So doesn't this basically boil down to recognising that new-idle might
not be the same as regular load-balancing -- we need any task, fast,
rather than we need to make equal load.

David's shared runqueue patches did the same, they re-imagined this very
path.

Now, David's thing went side-ways because of some regression that wasn't
further investigated.

In case of SHARED_RUNQ, I suspected frequent wakeup-sleep pattern of
hackbench at lower utilization seemed to raise some contention somewhere
but perf profile with IBS showed nothing specific and I left it there.

I revisited this again today and found this interesting data for perf
bench sched messaging running with one group pinned to one LLC domain on
my system:

- NO_SHARED_RUNQ

$ time ./perf bench sched messaging -p -t -l 100000 -g 1
# Running 'sched/messaging' benchmark:
# 20 sender and receiver threads per group
# 1 groups == 40 threads run
Total time: 3.972 [sec] (*)
real 0m3.985s
user 0m6.203s (*)
sys 1m20.087s (*)

$ sudo perf record -C 0-7,128-135 --off-cpu -- taskset -c 0-7,128-135 perf bench sched messaging -p -t -l 100000 -g 1
$ sudo perf report --no-children

Samples: 128 of event 'offcpu-time', Event count (approx.): 96,216,883,498 (*)
Overhead Command Shared Object Symbol
+ 51.43% sched-messaging libc.so.6 [.] read
+ 44.94% sched-messaging libc.so.6 [.] __GI___libc_write
+ 3.60% sched-messaging libc.so.6 [.] __GI___futex_abstimed_wait_cancelable64
0.03% sched-messaging libc.so.6 [.] __poll
0.00% sched-messaging perf [.] sender


- SHARED_RUNQ

$ time taskset -c 0-7,128-135 perf bench sched messaging -p -t -l 100000 -g 1
# Running 'sched/messaging' benchmark:
# 20 sender and receiver threads per group
# 1 groups == 40 threads run
Total time: 48.171 [sec] (*)
real 0m48.186s
user 0m5.409s (*)
sys 0m41.185s (*)

$ sudo perf record -C 0-7,128-135 --off-cpu -- taskset -c 0-7,128-135 perf bench sched messaging -p -t -l 100000 -g 1
$ sudo perf report --no-children

Samples: 157 of event 'offcpu-time', Event count (approx.): 5,882,929,338,882 (*)
Overhead Command Shared Object Symbol
+ 47.49% sched-messaging libc.so.6 [.] read
+ 46.33% sched-messaging libc.so.6 [.] __GI___libc_write
+ 2.40% sched-messaging libc.so.6 [.] __GI___futex_abstimed_wait_cancelable64
+ 1.08% snapd snapd [.] 0x000000000006caa3
+ 1.02% cron libc.so.6 [.] clock_nanosleep@GLIBC_2.2.5
+ 0.86% containerd containerd [.] runtime.futex.abi0
+ 0.82% containerd containerd [.] runtime/internal/syscall.Syscall6


(*) The runtime has bloated massively but both "user" and "sys" time
are down and the "offcpu-time" count goes up with SHARED_RUNQ.

There seems to be a corner case that is not accounted for but I'm not
sure where it lies currently. P.S. I tested this on a v6.8-rc4 kernel
since that is what I initially tested the series on but I can see the
same behavior when I rebased the changed on the current v6.10-rc5 based
tip:sched/core.


But it occurs to me this might be the same thing that Prateek chased
down here:

https://lkml.kernel.org/r/20240710090210.41856-1-kprateek.nayak@xxxxxxx

Hmm ?

Without the nohz_csd_func fix and the SM_IDLE fast-path (Patch 1 and 2),
currently, the scheduler depends on the newidle_balance to pull tasks to
an idle CPU. Vincent had pointed it out on the first RCF to tackle the
problem that tried to do what SM_IDLE does but for fair class alone:

https://lore.kernel.org/all/CAKfTPtC446Lo9CATPp7PExdkLhHQFoBuY-JMGC7agOHY4hs-Pw@xxxxxxxxxxxxxx/

It shouldn't be too frequent but it could be the reason why
newidle_balance() might jump up in traces, especially if it decides to
scan a domain with large number of CPUs (NUMA1/NUMA2 in Matt's case,
perhaps the PKG/NUMA in the case Chenyu was investigating initially).


Supposing that is indeed the case, I think it makes more sense to
proceed with that approach. That is, completely redo the sub-numa new
idle balance.



--
Thanks and Regards,
Prateek