Re: [PATCH v5 1/3] sched: Stop nohz stats when decayed

From: Valentin Schneider
Date: Wed Feb 21 2018 - 08:14:01 EST


On 02/16/2018 01:44 PM, Vincent Guittot wrote:
> On 16 February 2018 at 13:13, Valentin Schneider
> <valentin.schneider@xxxxxxx> wrote:
>> On 02/14/2018 03:26 PM, Vincent Guittot wrote:
>>> Stopped the periodic update of blocked load when all idle CPUs have fully
>>> decayed. We introduce a new nohz.has_blocked that reflect if some idle
>>> CPUs has blocked load that have to be periodiccally updated. nohz.has_blocked
>>> is set everytime that a Idle CPU can have blocked load and it is then clear
>>> when no more blocked load has been detected during an update. We don't need
>>> atomic operation but only to make cure of the right ordering when updating
>>> nohz.idle_cpus_mask and nohz.has_blocked.
>>>
>>> Suggested-by: Peter Zijlstra (Intel) <peterz@xxxxxxxxxxxxx>
>>> Signed-off-by: Vincent Guittot <vincent.guittot@xxxxxxxxxx>
>>> ---
>>> kernel/sched/fair.c | 122 ++++++++++++++++++++++++++++++++++++++++++---------
>>> kernel/sched/sched.h | 1 +
>>> 2 files changed, 102 insertions(+), 21 deletions(-)
>>>
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index 7af1fa9..5a6835e 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>>
>>> [...]
>>> @@ -9383,11 +9452,16 @@ static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
>>> * work being done for other cpus. Next load
>>> * balancing owner will pick it up.
>>> */
>>> - if (need_resched())
>>> - break;
>>> + if (need_resched()) {
>>> + has_blocked_load = true;
>>> + goto abort;
>>> + }
>>>
>>> rq = cpu_rq(balance_cpu);
>>>
>>
>> I'd say it's safe to do the following here. The flag is raised in
>> nohz_balance_enter_idle() before the smp_mb(), so we won't skip a CPU
>> that just got added to nohz.idle_cpus_mask.
>
> rq->has_blocked_load will be set before the barrier only if
> nohz_tick_stopped is not already set,
> Otherwise, we skip cpumask update and the barrier in nohz_balance_enter_idle
>

Right, forgot about that bit. I think it's still fine because:
- nohz_balance_enter_idle() can't be called before the last running task is
dequeued, which requires the rq lock.
- update_blocked_averages takes the rq lock and clears rq->has_blocked_load
with the lock held.

So even though we could have some very unlikely scenario where a CPU quickly
goes out/in of idle after nohz.idle_cpus_mask has been read, the blocked load
itself is safe so rq->has_blocked_load will end up being set correctly.
(Took me a while to see it that way)


BTW, with the current set on Peter's sched/testing, update_nohz_stats()
is called here, which doesn't do the update if !rq->has_blocked_load
(Although that check is done without lock/barrier, so maybe we could not see
a CPU that just went idle ?)

I have one more question on that bit:


has_blocked_load |= update_nohz_stats(rq, true);

/*
* If time for next balance is due,
* do the balance.
*/
if (time_after_eq(jiffies, rq->next_balance)) {
struct rq_flags rf;

rq_lock_irqsave(rq, &rf);
update_rq_clock(rq);
cpu_load_update_idle(rq);
rq_unlock_irqrestore(rq, &rf);

if (flags & NOHZ_BALANCE_KICK)
rebalance_domains(rq, CPU_IDLE);
}

if (time_after(next_balance, rq->next_balance)) {
next_balance = rq->next_balance;
update_next_balance = 1;
}


Now that I think about it, shouldn't we always have a 'continue' after
the blocked load update if (flags & NOHZ_KICK_MASK) == NOHZ_STATS_KICK ?
AFAICT we don't want to push the next_balance forward, only the next_blocked.
That would also take care of not doing the load balance.
>>
>> /*
>> * This cpu doesn't have any remaining blocked load, skip it.
>> * It's sane to do this because this flag is raised in
>> * nohz_balance_enter_idle()
>> */
>> if ((flags & NOHZ_KICK_MASK) == NOHZ_STATS_KICK &&
>> !rq->has_blocked_load)
>> continue;
>>
>>> + update_blocked_averages(rq->cpu);
>>> + has_blocked_load |= rq->has_blocked_load;
>>> +