Re: [RFC PATCH 0/7] Optimization to reduce the cost of newidle balance

From: Chen Yu
Date: Thu Jul 18 2024 - 13:01:53 EST


Hi Prateek,

On 2024-07-18 at 14:58:30 +0530, K Prateek Nayak wrote:
> Hello Peter,
>
> On 7/17/2024 5:47 PM, Peter Zijlstra wrote:
> > On Thu, Jul 27, 2023 at 10:33:58PM +0800, Chen Yu wrote:
> > > Hi,
> > >
> > > This is the second version of the newidle balance optimization[1].
> > > It aims to reduce the cost of newidle balance which is found to
> > > occupy noticeable CPU cycles on some high-core count systems.
> > >
> > > For example, when running sqlite on Intel Sapphire Rapids, which has
> > > 2 x 56C/112T = 224 CPUs:
> > >
> > > 6.69% 0.09% sqlite3 [kernel.kallsyms] [k] newidle_balance
> > > 5.39% 4.71% sqlite3 [kernel.kallsyms] [k] update_sd_lb_stats
> > >
> > > To mitigate this cost, the optimization is inspired by the question
> > > raised by Tim:
> > > Do we always have to find the busiest group and pull from it? Would
> > > a relatively busy group be enough?
> >
> > So doesn't this basically boil down to recognising that new-idle might
> > not be the same as regular load-balancing -- we need any task, fast,
> > rather than we need to make equal load.
> >
> > David's shared runqueue patches did the same, they re-imagined this very
> > path.
> >
> > Now, David's thing went side-ways because of some regression that wasn't
> > further investigated.
>
> In case of SHARED_RUNQ, I suspected frequent wakeup-sleep pattern of
> hackbench at lower utilization seemed to raise some contention somewhere
> but perf profile with IBS showed nothing specific and I left it there.
>
> I revisited this again today and found this interesting data for perf
> bench sched messaging running with one group pinned to one LLC domain on
> my system:
>
> - NO_SHARED_RUNQ
>
> $ time ./perf bench sched messaging -p -t -l 100000 -g 1
> # Running 'sched/messaging' benchmark:
> # 20 sender and receiver threads per group
> # 1 groups == 40 threads run
> Total time: 3.972 [sec] (*)
> real 0m3.985s
> user 0m6.203s (*)
> sys 1m20.087s (*)
>
> $ sudo perf record -C 0-7,128-135 --off-cpu -- taskset -c 0-7,128-135 perf bench sched messaging -p -t -l 100000 -g 1
> $ sudo perf report --no-children
>
> Samples: 128 of event 'offcpu-time', Event count (approx.): 96,216,883,498 (*)
> Overhead Command Shared Object Symbol
> + 51.43% sched-messaging libc.so.6 [.] read
> + 44.94% sched-messaging libc.so.6 [.] __GI___libc_write
> + 3.60% sched-messaging libc.so.6 [.] __GI___futex_abstimed_wait_cancelable64
> 0.03% sched-messaging libc.so.6 [.] __poll
> 0.00% sched-messaging perf [.] sender
>
>
> - SHARED_RUNQ
>
> $ time taskset -c 0-7,128-135 perf bench sched messaging -p -t -l 100000 -g 1
> # Running 'sched/messaging' benchmark:
> # 20 sender and receiver threads per group
> # 1 groups == 40 threads run
> Total time: 48.171 [sec] (*)
> real 0m48.186s
> user 0m5.409s (*)
> sys 0m41.185s (*)
>
> $ sudo perf record -C 0-7,128-135 --off-cpu -- taskset -c 0-7,128-135 perf bench sched messaging -p -t -l 100000 -g 1
> $ sudo perf report --no-children
>
> Samples: 157 of event 'offcpu-time', Event count (approx.): 5,882,929,338,882 (*)
> Overhead Command Shared Object Symbol
> + 47.49% sched-messaging libc.so.6 [.] read
> + 46.33% sched-messaging libc.so.6 [.] __GI___libc_write
> + 2.40% sched-messaging libc.so.6 [.] __GI___futex_abstimed_wait_cancelable64
> + 1.08% snapd snapd [.] 0x000000000006caa3
> + 1.02% cron libc.so.6 [.] clock_nanosleep@GLIBC_2.2.5
> + 0.86% containerd containerd [.] runtime.futex.abi0
> + 0.82% containerd containerd [.] runtime/internal/syscall.Syscall6
>
>
> (*) The runtime has bloated massively but both "user" and "sys" time
> are down and the "offcpu-time" count goes up with SHARED_RUNQ.
>
> There seems to be a corner case that is not accounted for but I'm not
> sure where it lies currently. P.S. I tested this on a v6.8-rc4 kernel
> since that is what I initially tested the series on but I can see the
> same behavior when I rebased the changed on the current v6.10-rc5 based
> tip:sched/core.
>
> >
> > But it occurs to me this might be the same thing that Prateek chased
> > down here:
> >
> > https://lkml.kernel.org/r/20240710090210.41856-1-kprateek.nayak@xxxxxxx
> >
> > Hmm ?
>
> Without the nohz_csd_func fix and the SM_IDLE fast-path (Patch 1 and 2),
> currently, the scheduler depends on the newidle_balance to pull tasks to
> an idle CPU. Vincent had pointed it out on the first RCF to tackle the
> problem that tried to do what SM_IDLE does but for fair class alone:
>
> https://lore.kernel.org/all/CAKfTPtC446Lo9CATPp7PExdkLhHQFoBuY-JMGC7agOHY4hs-Pw@xxxxxxxxxxxxxx/
>
> It shouldn't be too frequent but it could be the reason why
> newidle_balance() might jump up in traces, especially if it decides to
> scan a domain with large number of CPUs (NUMA1/NUMA2 in Matt's case,
> perhaps the PKG/NUMA in the case Chenyu was investigating initially).
>

Yes, this is my understanding too, I'll apply your patches and have a re-test.

thanks,
Chenyu