Re: [PATCH v3 3/3] sched: update blocked load when newly idle

From: Vincent Guittot
Date: Mon Feb 12 2018 - 11:07:00 EST


On 12 February 2018 at 16:38, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
> On Mon, Feb 12, 2018 at 03:34:44PM +0100, Vincent Guittot wrote:
>> Le Monday 12 Feb 2018 Ã 13:04:11 (+0100), Peter Zijlstra a Ãcrit :
>> > On Mon, Feb 12, 2018 at 09:07:54AM +0100, Vincent Guittot wrote:
>
>> > So I really hate this one, also I suspect its broken, because we do this
>> > check before dropping rq->lock and _nohz_idle_balance() will take
>> > rq->lock.
>>
>> yes. it will take both newly idle rq and idle rq lock
>
> Right, can't do that, there's ordering rules for multiple RQ locks etc..
>
>>
>> >
>> >
>> > Aside from the above being an unreadable mess, I dislike that it breaks
>> > the various isolation crud, we should not touch CPUs outside of our
>> > domain.
>> >
>> >
>> > Maybe something like the below? (unfinished)
>> >
>>
>> good catch. I completely miss the isolation stuff.
>> But isn't already the case when kicking ilb ? I mean that an idle CPU touches
>> all idle CPUs and some can be outside its domain during ilb.
>
>> Shouldn't we test housekeeping_cpu(cpu, HK_FLAG_SCHED) instead if we want to
>> make sure that an isolated/full nohz CPU will not be used for updating blocked
>> load of CPUs outside its domain ?
>
> I _thought_ we had some 'housekeeping' crud in the ilb selection logic,
> but now I can't find it. Frederic?
>
>> Is something below more readable:
>>
>> /*
>> + * This CPU doesn't want to be disturbed by scheduler
>> + * houskeeping
>> */
>> + if (!housekeeping_cpu(cpu, HK_FLAG_SCHED))
>> + goto out;
>> +
>> + /* Will wake up very soon. No time for doing anything else*/
>> + if (this_rq->avg_idle < sysctl_sched_migration_cost)
>> + goto out;
>> +
>> + /* Don't need to update blocked load of idle CPUs*/
>> + if (!has_blocked || time_after_eq(jiffies, next_blocked)
>> + goto out;
>> +
>> + raw_spin_unlock(&this_rq->lock);
>> + /*
>> + * This CPU is going to be idle and blocked load of idle CPUs
>> + * need to be updated. Run the ilb locally as it is a good
>> + * candidate for ilb instead of waking up another idle CPU.
>> + * Kick an normal ilb if we failed to do the update.
>> + */
>> + if !_nohz_idle_balance(this_rq, NOHZ_STATS_KICK, CPU_NEWLY_IDLE))
>> kick_ilb(NOHZ_STATS_KICK);
>> + raw_spin_lock(&this_rq->lock);
>>
>> goto out;
>
> It is, but I think you're still doing that avg_idle thing twice now,
> right?

yes the goal was to try to not exceed idle time but I wonder if it is
really needed because the need_resched() in the
"for_each_cpu(balance_cpu, nohz.idle_cpus_mask) {
" will abort the loop if something is schedule on this_cpu just like
for a normal ilb().
So I think that we can remove this test with avg_idle.

>
>> > @@ -7850,7 +7850,7 @@ static bool update_nohz_stats(struct rq
>> > if (!cpumask_test_cpu(cpu, nohz.idle_cpus_mask))
>> > return false;
>> >
>> > - if (!time_after(jiffies, rq->last_blocked_load_update_tick))
>> > + if (!force && !time_after(jiffies, rq->last_blocked_load_update_tick))
>>
>> This fix the concern raised on the other thread, isn't it ?
>
> Yes.
>
>> > +static int nohz_age(struct sched_domain *sd)
>> > +{
>> > + struct cpumask *cpus = this_cpu_cpumask_var_ptr(load_balance_mask);
>> > + bool has_blocked_load;
>> > +
>> > + WRITE_ONCE(nohz.has_blocked, 0);
>> > +
>> > + smp_mb();
>> > +
>> > + cpumask_and(cpus, sched_domain_span(sd), nohz.idle_cpus_mask);
>> > +
>> > + has_blocked_load = cpumask_subset(nohz.idle_cpus_mask, sched_domain_span(sd));
>> > +
>> > + for_each_cpu(cpu, cpus) {
>> > + struct rq *rq = cpu_rq(cpu);
>> > +
>> > + has_blocked_load |= update_nohz_stats(rq, true);
>> > + }
>> > +
>> > + if (has_blocked_load)
>> > + WRITE_ONCE(nohz.has_blocked, 1);
>> > +}
>> > +
>>
>> we duplicate what is done in nohe_idle_balance
>
> In parts yes.. I was too lazy to combine :-)
>
>> > @@ -8919,9 +8955,13 @@ static int idle_balance(struct rq *this_
>> > if (sd->flags & SD_BALANCE_NEWIDLE) {
>> > t0 = sched_clock_cpu(this_cpu);
>> >
>> > - pulled_task = load_balance(this_cpu, this_rq,
>> > - sd, CPU_NEWLY_IDLE,
>> > - &continue_balancing);
>> > + if (nohz_blocked) {
>> > + nohz_age(sd);
>>
>> Do we really need to loop all sched_domain of newly idle CPU and call
>> nohz_age for each level ?
>> Can't we only call nohz_age with the widest/last sched_domain level ?
>
> Yeah, dunno. I went back and forth on that a bit. The largest is
> rq->rd->span. The reason I settled on this variant in the end is that it
> keeps locality. When short idle, it will only scan nearby CPUs instead
> of reaching half-way across the machine.
>
>> Furthermore, we use sd->max_newidle_lb_cost to decide to abort the loop.
>> But this is updated with full load balancing which is longer than just
>> updating blocked load.
>> This will increase the chance to abort before reaching the last level.
>
> Yes.. I figured we'd take that hit :/
>
>> > + } else {
>> > + pulled_task = load_balance(this_cpu, this_rq,
>> > + sd, CPU_NEWLY_IDLE,
>> > + &continue_balancing);
>> > + }
>> >
>> > domain_cost = sched_clock_cpu(this_cpu) - t0;
>> > if (domain_cost > sd->max_newidle_lb_cost)
>>
>> We have to kick an ilb if we must abort before looping all levels and all
>> idle CPUs otherwise we can have situation where the load of some idle CPus
>> could stay blocked
>
> Yes, like said, was unfinished, I gave up before I got to that.