Re: [PATCH RFC] sched/fair: Avoid unnecessary IPIs for ILB

From: Joel Fernandes
Date: Sun Oct 08 2023 - 12:39:23 EST


Hi Peter,

On Fri, Oct 06, 2023 at 10:01:29PM +0200, Peter Zijlstra wrote:
> On Thu, Oct 05, 2023 at 04:17:26PM +0000, Joel Fernandes (Google) wrote:
> > From: Vineeth Pillai <vineethrp@xxxxxxxxxx>
> >
> > Whenever a CPU stops its tick, it now requires another idle CPU to handle the
> > balancing for it because it can't perform its own periodic load balancing.
> > This means it might need to update 'nohz.next_balance' to 'rq->next_balance' if
> > the upcoming nohz-idle load balancing is too distant in the future. This update
> > process is done by triggering an ILB, as the general ILB handler
> > (_nohz_idle_balance) that manages regular nohz balancing also refreshes
> > 'nohz.next_balance' by looking at the 'rq->next_balance' of all other idle CPUs
> > and selecting the smallest value.
> >
> > Triggering this ILB can be achieved by setting the NOHZ_NEXT_KICK flag. This
> > primarily results in the ILB handler updating 'nohz.next_balance' while
> > possibly not doing any load balancing at all. However, sending an IPI merely to
> > refresh 'nohz.next_balance' seems excessive, and there ought to be a more
> > efficient method to update 'nohz.next_balance' from the local CPU.
> >
> > Fortunately, there already exists a mechanism to directly invoke the ILB
> > handler (_nohz_idle_balance) without initiating an IPI. It's accomplished by
> > setting the NOHZ_NEWILB_KICK flag. This flag is set during regular "newly idle"
> > balancing and solely exists to update a CPU's blocked load if it couldn't pull
> > more tasks during regular "newly idle balancing" - and it does so without
> > having to send any IPIs. Once the flag is set, the ILB handler is called
> > directly from do_idle()-> nohz_run_idle_balance(). While its goal is to update
> > the blocked load without an IPI, in our situation, we aim to refresh
> > 'nohz.next_balance' without an IPI but we can piggy back on this.
> >
> > So in this patch, we reuse this mechanism by also setting the NOHZ_NEXT_KICK to
> > indicate nohz.next_balance needs an update via this direct call shortcut. Note
> > that we set this flag without knowledge that the tick is about to be stopped,
> > because at the point we do it, we have no way of knowing that. However we do
> > know that the CPU is about to enter idle. In our testing, the reduction in IPIs
> > is well worth updating nohz.next_balance a few more times.
> >
> > Also just to note, without this patch we observe the following pattern:
> >
> > 1. A CPU is about to stop its tick.
> > 2. It sets nohz.needs_update to 1.
> > 3. It then stops its tick and goes idle.
> > 4. The scheduler tick on another CPU checks this flag and decides an ILB kick is needed.
> > 5. The ILB CPU ends up being the one that just stopped its tick!
> > 6. This results in an IPI to the tick-stopped CPU which ends up waking it up
> > and disturbing it!
> >
> > Testing shows a considerable reduction in IPIs when doing this:
> >
> > Running "cyclictest -i 100 -d 100 --latency=1000 -t -m" on a 4vcpu VM
> > the IPI call count profiled over 10s period is as follows:
> > without fix: ~10500
> > with fix: ~1000
> >
> > Fixes: 7fd7a9e0caba ("sched/fair: Trigger nohz.next_balance updates when a CPU goes NOHZ-idle")
>
> Hurm.. does this really warrant a Fixes tag? Afaict nothing is currently
> broken -- this is a pure optimization question, no?

IMHO it is a breakage as it breaks NOHZ -- a lot of times the ILB kicks back
the CPU stopping the tick out of idle (effectively breaking NOHZ). The large
number of IPIs also wrecks power and it happens only on 6.1 and after. Having
the fixes tag means it will also goto all stable kernels >= 6.1. Hope that
sounds reasonable and thank you for taking a look!

thanks,

- Joel