Re: [PATCH] sched/fair: Fix nohz.next_balance update
From: Vincent Guittot
Date: Tue May 05 2020 - 10:27:19 EST
Le mardi 05 mai 2020 à 21:40:56 (+0800), Peng Liu a écrit :
> On Mon, May 04, 2020 at 05:17:11PM +0200, Vincent Guittot wrote:
> > On Sun, 3 May 2020 at 10:34, Peng Liu <iwtbavbm@xxxxxxxxx> wrote:
> > >
> > > commit c5afb6a87f23 ("sched/fair: Fix nohz.next_balance update")
> > > During idle load balance, this_cpu(ilb) do load balance for the other
> > > idle CPUs, also gather the earliest (nohz.)next_balance.
> > >
> > > Since commit:
> > > 'b7031a02ec75 ("sched/fair: Add NOHZ_STATS_KICK")'
> > >
> > > We update nohz.next_balance like this:
> > >
> > > _nohz_idle_balance() {
> > > for_each_cpu(nohz.idle_cpus_mask) {
> > > rebalance_domains() {
> > > update nohz.next_balance <-- compare and update
> > > }
> > > }
> > > rebalance_domains(this_cpu) {
> > > update nohz.next_balance <-- compare and update
> > > }
> > > update nohz.next_balance <-- unconditionally update
> > > }
> > >
> > > For instance, nohz.idle_cpus_mask spans {cpu2,3,5,8}, and this_cpu is
> > > cpu5. After the above loop we could gather the earliest *next_balance*
> > > among {cpu2,3,8}, then rebalance_domains(this_cpu) update
> > > nohz.next_balance with this_rq->next_balance, but finally overwrite
> > > nohz.next_balance with the earliest *next_balance* among {cpu2,3,8},
> > > we may end up with not getting the earliest next_balance.
> > >
> > > Since we can gather all the updated rq->next_balance, including this_cpu,
> > > in _nohz_idle_balance(), it's safe to remove the extra lines in
> > > rebalance_domains() which are originally intended for this_cpu. And
> > > finally the updating only happen in _nohz_idle_balance().
> >
> > I'm not sure that's always true. Nothing prevents nohz_idle_balance()
> > to return false . Then run_rebalance_domains() calls
> > rebalance_domains(this_rq ,SCHED_IDLE) outside _nohz_idle_balance().
> > In this case we must keep the code in rebalance_domains().
> >
> > For example when the tick is not stopped when entering idle. Or when
> > need_resched() returns true.
> >
> > So instead of removing the code from rebalance_domains, you should
> > move the one in _nohz_idle_balance() to make sure that the "if
> > (likely(update_next_balance)) ..." is called before calling
> > rebalance_domains for the local cpu
> >
>
> Yes, you're right. When need_resched() returns true, things become out
> of expectation. We haven't really got the earliest next_balance, abort
> the update immediately and let the successor to help. Doubtless this
> will incur some overhead due to the repeating work.
There should not be some repeating works because CPUs and sched_domain, which
have already been balanced, will not be rebalanced until the next load balance
interval.
Futhermore, there is in fact still work to do bcause not all the idle CPUs got
a chance to pull work
>
>
> About the "tick is not stopped when entering idle" case, defer the
> update to nohz_balance_enter_idle() would be a choice too.
>
>
> Of course, only update nohz.next_balance in rebalance_domains() is the
> simpliest way, but as @Valentin put, too many write to it may incur
> unnecessary overhead. If we can gather the earliest next_balance in
This is not really possible because we have to move it to the next interval.
> advance, then a single write is considered to be better.
>
> By the way, remove the redundant check in nohz_idle_balance().
>
> FWIW, how about the below?
Your proposal below looks quite complex. IMO, one solution would be to move the
update of nohz.next_balance before calling rebalance_domains(this_rq, CPU_IDLE)
so you are back to the previous behavior.
The only difference is that in case of an break because of need_resched, it
doesn't update nohz.next_balance. But on the other hand, we haven't yet
finished run rebalance_domains for all CPUs and some load_balance are still
pending. In fact, this will be done during next tick by an idle CPU.
So I would be in favor of something as simple as :
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 04098d678f3b..e028bc1c4744 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10457,6 +10457,14 @@ static bool _nohz_idle_balance(struct rq *this_rq, unsigned int flags,
}
}
+ /*
+ * next_balance will be updated only when there is a need.
+ * When the CPU is attached to null domain for ex, it will not be
+ * updated.
+ */
+ if (likely(update_next_balance))
+ nohz.next_balance = next_balance;
+
/* Newly idle CPU doesn't need an update */
if (idle != CPU_NEWLY_IDLE) {
update_blocked_averages(this_cpu);
@@ -10477,14 +10485,6 @@ static bool _nohz_idle_balance(struct rq *this_rq, unsigned int flags,
if (has_blocked_load)
WRITE_ONCE(nohz.has_blocked, 1);
- /*
- * next_balance will be updated only when there is a need.
- * When the CPU is attached to null domain for ex, it will not be
- * updated.
- */
- if (likely(update_next_balance))
- nohz.next_balance = next_balance;
-
return ret;
}
> ***********************************************
> * Below code is !!!ENTIRELY UNTESTED!!!, just *
> * a draft to see whehter it's sensible! *
> ***********************************************
> -------------------<-----------------------
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 02f323b85b6d..a7d63ea706ac 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9943,22 +9943,8 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)
> * When the cpu is attached to null domain for ex, it will not be
> * updated.
> */
> - if (likely(update_next_balance)) {
> + if (likely(update_next_balance))
> rq->next_balance = next_balance;
> -
> -#ifdef CONFIG_NO_HZ_COMMON
> - /*
> - * If this CPU has been elected to perform the nohz idle
> - * balance. Other idle CPUs have already rebalanced with
> - * nohz_idle_balance() and nohz.next_balance has been
> - * updated accordingly. This CPU is now running the idle load
> - * balance for itself and we need to update the
> - * nohz.next_balance accordingly.
> - */
> - if ((idle == CPU_IDLE) && time_after(nohz.next_balance, rq->next_balance))
> - nohz.next_balance = rq->next_balance;
> -#endif
> - }
> }
>
> static inline int on_null_domain(struct rq *rq)
> @@ -10218,6 +10204,9 @@ void nohz_balance_enter_idle(int cpu)
>
> rq->nohz_tick_stopped = 1;
>
> + if (time_after(nohz.next_balance, rq->next_balance))
> + nohz.next_balance = rq->next_balance;
> +
> cpumask_set_cpu(cpu, nohz.idle_cpus_mask);
> atomic_inc(&nohz.nr_cpus);
>
> @@ -10287,6 +10276,7 @@ static bool _nohz_idle_balance(struct rq *this_rq, unsigned int flags,
> */
> if (need_resched()) {
> has_blocked_load = true;
> + update_next_balance = 0;
> goto abort;
> }
>
> @@ -10321,9 +10311,15 @@ static bool _nohz_idle_balance(struct rq *this_rq, unsigned int flags,
> has_blocked_load |= this_rq->has_blocked_load;
> }
>
> - if (flags & NOHZ_BALANCE_KICK)
> + if (flags & NOHZ_BALANCE_KICK) {
> rebalance_domains(this_rq, CPU_IDLE);
>
> + if (time_after(next_balance, this_rq->next_balance)) {
> + next_balance = this_rq->next_balance;
> + update_next_balance = 1;
> + }
> + }
> +
> WRITE_ONCE(nohz.next_blocked,
> now + msecs_to_jiffies(LOAD_AVG_PERIOD));
>
> @@ -10354,9 +10350,7 @@ static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
> {
> int this_cpu = this_rq->cpu;
> unsigned int flags;
> -
> - if (!(atomic_read(nohz_flags(this_cpu)) & NOHZ_KICK_MASK))
> - return false;
why did you remove this ?
> + bool done;
>
> if (idle != CPU_IDLE) {
> atomic_andnot(NOHZ_KICK_MASK, nohz_flags(this_cpu));
> @@ -10368,9 +10362,16 @@ static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
> if (!(flags & NOHZ_KICK_MASK))
> return false;
>
> - _nohz_idle_balance(this_rq, flags, idle);
> + /*
> + * If idle load balance terinated due to this CPU become busy,
> + * pretend it has successfully pulled some loads, and abort
> + * the following load balance.
> + */
> + done = _nohz_idle_balance(this_rq, flags, idle);
> + if (done == false && need_resched())
> + return true;
>
> - return true;
> + return done;
> }
>
> static void nohz_newidle_balance(struct rq *this_rq)