Re: [PATCH RFC] sched/fair: Avoid unnecessary IPIs for ILB

From: Joel Fernandes
Date: Tue Oct 10 2023 - 15:32:13 EST




On Tue, Oct 10, 2023 at 3:15 AM Vincent Guittot <vincent.guittot@xxxxxxxxxx> wrote:
>
> On Sun, 8 Oct 2023 at 19:35, Joel Fernandes <joel@xxxxxxxxxxxxxxxxx> wrote:
[...]
> > One thing I am confused about in the original code is:
> >
> > tick_nohz_idle_stop_tick() is what sets the nohz.idle_cpus_mask.
> > However, nohz_run_idle_balance() is called before that can happen, in
> > do_idle(). So it is possible that NOHZ_NEWILB_KICK is set for a CPU but it is
> > not yet in the mask.
>
> 2 things:
> - update of nohz.idle_cpus_mask is not strictly aligned with cpu
> entering/exiting idle state. It is set when entering but only cleared
> during the next tick on the cpu because nohz.idle_cpus_mask is a
> bottleneck when all CPUs enter/exit idle at high rate (usec). This
> implies that a cpu entering idle can already be set in
> nohz.idle_cpus_mask
> - NOHZ_NEWILB_KICK is more about updating blocked load of others
> already idle CPUs than the one entering idle which has already updated
> its blocked load in newidle_balance()
>
> The goal of nohz_run_idle_balance() is to run ILB only for updating
> the blocked load of already idle CPUs without waking up one of those
> idle CPUs and outside the preempt disable / irq off phase of the local
> cpu about to enter idle because it can take a long time.

This makes complete sense, thank you for the background on this!

Vineeth was just telling me in a 1:1 that he also tried doing the removal of
the CPU from the idle mask in the restart-tick path. The result was that even
though the mask modification is not as often as when doing it during the CPU
coming out of idle, it is still higher than just doing it from the next busy
tick, like in current mainline.

As next steps we are looking into:
1. Monitor how often we set NEXT_KICK -- we think we can reduce the frequency
of these even more and keep the overhead low.

2. Look more into the parallelism of nohz.next_balance updates (due to our
next NEXT_KICK setting) and handle any race conditions. We are at the moment
looking into if nohz.next_balance does not get set to the earliest value
because of a race, and if so retry the operation.

Something like (untested):

if (likely(update_next_balance)) {
do {
WRITE_ONCE(nohz.next_balance, next_balance);
if (likely(READ_ONCE(nohz.next_balance) <= next_balance)) {
break;
}
cpu_relax();
}
}

Or a try_cmpxchg loop.

I think after these items and a bit more testing, we should be good to send v2.

Thanks.