Re: [PATCH v3 3/3] sched: update blocked load when newly idle

From: Peter Zijlstra
Date: Mon Feb 12 2018 - 10:38:29 EST


On Mon, Feb 12, 2018 at 03:34:44PM +0100, Vincent Guittot wrote:
> Le Monday 12 Feb 2018 à 13:04:11 (+0100), Peter Zijlstra a écrit :
> > On Mon, Feb 12, 2018 at 09:07:54AM +0100, Vincent Guittot wrote:

> > So I really hate this one, also I suspect its broken, because we do this
> > check before dropping rq->lock and _nohz_idle_balance() will take
> > rq->lock.
>
> yes. it will take both newly idle rq and idle rq lock

Right, can't do that, there's ordering rules for multiple RQ locks etc..

>
> >
> >
> > Aside from the above being an unreadable mess, I dislike that it breaks
> > the various isolation crud, we should not touch CPUs outside of our
> > domain.
> >
> >
> > Maybe something like the below? (unfinished)
> >
>
> good catch. I completely miss the isolation stuff.
> But isn't already the case when kicking ilb ? I mean that an idle CPU touches
> all idle CPUs and some can be outside its domain during ilb.

> Shouldn't we test housekeeping_cpu(cpu, HK_FLAG_SCHED) instead if we want to
> make sure that an isolated/full nohz CPU will not be used for updating blocked
> load of CPUs outside its domain ?

I _thought_ we had some 'housekeeping' crud in the ilb selection logic,
but now I can't find it. Frederic?

> Is something below more readable:
>
> /*
> + * This CPU doesn't want to be disturbed by scheduler
> + * houskeeping
> */
> + if (!housekeeping_cpu(cpu, HK_FLAG_SCHED))
> + goto out;
> +
> + /* Will wake up very soon. No time for doing anything else*/
> + if (this_rq->avg_idle < sysctl_sched_migration_cost)
> + goto out;
> +
> + /* Don't need to update blocked load of idle CPUs*/
> + if (!has_blocked || time_after_eq(jiffies, next_blocked)
> + goto out;
> +
> + raw_spin_unlock(&this_rq->lock);
> + /*
> + * This CPU is going to be idle and blocked load of idle CPUs
> + * need to be updated. Run the ilb locally as it is a good
> + * candidate for ilb instead of waking up another idle CPU.
> + * Kick an normal ilb if we failed to do the update.
> + */
> + if !_nohz_idle_balance(this_rq, NOHZ_STATS_KICK, CPU_NEWLY_IDLE))
> kick_ilb(NOHZ_STATS_KICK);
> + raw_spin_lock(&this_rq->lock);
>
> goto out;

It is, but I think you're still doing that avg_idle thing twice now,
right?

> > @@ -7850,7 +7850,7 @@ static bool update_nohz_stats(struct rq
> > if (!cpumask_test_cpu(cpu, nohz.idle_cpus_mask))
> > return false;
> >
> > - if (!time_after(jiffies, rq->last_blocked_load_update_tick))
> > + if (!force && !time_after(jiffies, rq->last_blocked_load_update_tick))
>
> This fix the concern raised on the other thread, isn't it ?

Yes.

> > +static int nohz_age(struct sched_domain *sd)
> > +{
> > + struct cpumask *cpus = this_cpu_cpumask_var_ptr(load_balance_mask);
> > + bool has_blocked_load;
> > +
> > + WRITE_ONCE(nohz.has_blocked, 0);
> > +
> > + smp_mb();
> > +
> > + cpumask_and(cpus, sched_domain_span(sd), nohz.idle_cpus_mask);
> > +
> > + has_blocked_load = cpumask_subset(nohz.idle_cpus_mask, sched_domain_span(sd));
> > +
> > + for_each_cpu(cpu, cpus) {
> > + struct rq *rq = cpu_rq(cpu);
> > +
> > + has_blocked_load |= update_nohz_stats(rq, true);
> > + }
> > +
> > + if (has_blocked_load)
> > + WRITE_ONCE(nohz.has_blocked, 1);
> > +}
> > +
>
> we duplicate what is done in nohe_idle_balance

In parts yes.. I was too lazy to combine :-)

> > @@ -8919,9 +8955,13 @@ static int idle_balance(struct rq *this_
> > if (sd->flags & SD_BALANCE_NEWIDLE) {
> > t0 = sched_clock_cpu(this_cpu);
> >
> > - pulled_task = load_balance(this_cpu, this_rq,
> > - sd, CPU_NEWLY_IDLE,
> > - &continue_balancing);
> > + if (nohz_blocked) {
> > + nohz_age(sd);
>
> Do we really need to loop all sched_domain of newly idle CPU and call
> nohz_age for each level ?
> Can't we only call nohz_age with the widest/last sched_domain level ?

Yeah, dunno. I went back and forth on that a bit. The largest is
rq->rd->span. The reason I settled on this variant in the end is that it
keeps locality. When short idle, it will only scan nearby CPUs instead
of reaching half-way across the machine.

> Furthermore, we use sd->max_newidle_lb_cost to decide to abort the loop.
> But this is updated with full load balancing which is longer than just
> updating blocked load.
> This will increase the chance to abort before reaching the last level.

Yes.. I figured we'd take that hit :/

> > + } else {
> > + pulled_task = load_balance(this_cpu, this_rq,
> > + sd, CPU_NEWLY_IDLE,
> > + &continue_balancing);
> > + }
> >
> > domain_cost = sched_clock_cpu(this_cpu) - t0;
> > if (domain_cost > sd->max_newidle_lb_cost)
>
> We have to kick an ilb if we must abort before looping all levels and all
> idle CPUs otherwise we can have situation where the load of some idle CPus
> could stay blocked

Yes, like said, was unfinished, I gave up before I got to that.