Re: [RFC PATCH 0/7] Optimization to reduce the cost of newidle balance

From: Chen Yu
Date: Thu Jul 18 2024 - 12:57:46 EST


Hi Peter,

On 2024-07-17 at 14:17:45 +0200, Peter Zijlstra wrote:
> On Thu, Jul 27, 2023 at 10:33:58PM +0800, Chen Yu wrote:
> > Hi,
> >
> > This is the second version of the newidle balance optimization[1].
> > It aims to reduce the cost of newidle balance which is found to
> > occupy noticeable CPU cycles on some high-core count systems.
> >
> > For example, when running sqlite on Intel Sapphire Rapids, which has
> > 2 x 56C/112T = 224 CPUs:
> >
> > 6.69% 0.09% sqlite3 [kernel.kallsyms] [k] newidle_balance
> > 5.39% 4.71% sqlite3 [kernel.kallsyms] [k] update_sd_lb_stats
> >
> > To mitigate this cost, the optimization is inspired by the question
> > raised by Tim:
> > Do we always have to find the busiest group and pull from it? Would
> > a relatively busy group be enough?
>
> So doesn't this basically boil down to recognising that new-idle might
> not be the same as regular load-balancing -- we need any task, fast,
> rather than we need to make equal load.
>

Yes, exactly.

> David's shared runqueue patches did the same, they re-imagined this very
> path.
>
> Now, David's thing went side-ways because of some regression that wasn't
> further investigated.
>
> But it occurs to me this might be the same thing that Prateek chased
> down here:
>
> https://lkml.kernel.org/r/20240710090210.41856-1-kprateek.nayak@xxxxxxx
>
> Hmm ?
>

Thanks for the patch link. I took a look and if I understand correctly,
Prateek's patch fixes three issues related to TIF_POLLING_NRFLAG.
And the following two issues might cause aggressive newidle balance:

1. normal idle load balance does not have a chance to be triggered
when exiting the idle loop. Since normal idle load balance does not
work, we have to count on newidle balance to do more work.

2. newly idle load balance is incorrectly triggered when exiting from
idle due to send_ipi(), even there is no task about to sleep.

Issue 2 will increase the frequency of invoking newly idle balance,
but issue 1 would not. Issue 1 mainly impacts the success ratio
of each newidle balance, but might not increase the frequency
to trigger a newidle balance - it should mainly depend on the behavior
of task runtime duration. Please correct me if I'm wrong.

All Prateek's 3 patches fix the existing newidle balance issue, I'll apply
his patch set and have a re-test.

> Supposing that is indeed the case, I think it makes more sense to
> proceed with that approach. That is, completely redo the sub-numa new
> idle balance.
>

I did not quite follow this, Prateek's patch set does not redo the sub-numa new
idle balance I suppose? Or do you mean further work based on Prateek's patch set?

thanks,
Chenyu