Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK

From: Vincent Guittot
Date: Tue Jan 30 2018 - 03:32:56 EST


On 29 January 2018 at 20:31, Valentin Schneider
<valentin.schneider@xxxxxxx> wrote:
> Hi Vincent, Peter,
>
> I've been running some tests on your patches (Peter's base + the 2 from
> Vincent). The results themselves are hosted at [1].
> The base of those tests is the same: a task ("accumulator") is ran for 5
> seconds (arbitrary value) to accumulate some load, then goes to sleep for .5
> seconds.
>
> I've set up 3 test scenarios:
>
> Update by nohz_balance_kick()
> -----------------------------
> Right before the "accumulator" task goes to sleep, a CPU-hogging task (100%
> utilization) is spawned on another CPU. It won't go idle so the only way to
> update the blocked load generated by "accumulator" is to kick an ILB
> (NOHZ_STATS_KICK).
>
> The test shows that this is behaving nicely - we keep kicking an ILB every
> ~36ms (see next test for comments on that) until there is no more blocked
> load. I did however notice some interesting scenarios: after the load has
> been fully decayed, a tiny background task can spawn and end in less than a
> scheduling period. However, it still goes through nohz_balance_enter_idle(),
> and thus sets nohz.stats_state, which will later cause an ILB kick.
>
> This makes me wonder if it's worth kicking ILBs for such tiny load values -
> perhaps it could be worth having a margin to set rq->has_blocked_load ?

So it's difficult to know what will be the load/utilization on the
cfs_rq once the cpu wakes up. Even if it's for a really short time,
that's doesn't mean that the load/utilization is small because it can
be the migration of a big task that just have a very short wakes up
this time.
That's why I don't make any assumption on the utilization/load value
when a cpu goes to sleep

>
> Furthermore, this tiny task will cause the ILB to iterate over all of the
> idle CPUs, although only one has stale load. For load update via NEWLY_IDLE
> load_balance() we use:
>
> static bool update_nohz_stats(struct rq *rq)
> {
> if (!rq->has_blocked_load)
> return false;
> [...]
> }
>
> But for load update via _nohz_idle_balance(), we iterate through all of the
> nohz CPUS and unconditionally call update_blocked_averages(). This could be
> avoided by remembering which CPUs have stale load before going idle.
> Initially I thought that was what nohz.stats_state was for, but it isn't.
> With Vincent's patches it's only ever set to either 0 or 1, but we could use
> it as a CPU mask, and use it to skip nohz CPUs that don't have stale load in
> _nohz_idle_balance() (when NOHZ_STATS_KICK).

I have studied a way to keep track of how many cpus still have blocked
load to try to minimize the number of useless ilb kick but this add
more atomic operations which can impact the system throughput with
heavy load and lot of very small wake up. that's why i have propose
this solution which is more simple. But it's probably just a matter of
where we want to "waste" time. Either we accept to spent a bit more
time to check the state of idle CPUs or we accept to kick ilb from
time to time for no good reason.

>
> Update by idle_balance()
> ------------------------
> Right before the "accumulator" task goes to sleep, a tiny periodic
> (period=32ms) task is spawned on another CPU. It's expected that it will
> update the blocked load in idle_balance(), either by running
> _nohz_idle_balance() locally or kicking an ILB (The overload flag shouldn't
> be set in this test case, so we shouldn't go through the NEWLY_IDLE
> load_balance()).
>
> This also seems to be working fine, but I'm noticing a delay between load
> updates that is closer to 64ms than 32ms. After digging into it I found out
> that the time checks done in idle_balance() and nohz_balancer_kick() are
> time_after(jiffies, next_stats), but IMHO they should be
> time_after_eq(jiffies, next_stats) to have 32ms-based updates. This also
> explains the 36ms periodicity of the updates in the test above.

I have use the 32ms as a minimum value between update. We must use the
time_after() if we want to have at least 32ms between each update. We
will have a 36ms period if the previous update was triggered by the
tick (just after in fact) but there will be only 32ms if the last
update was done during an idle_balance that happens just before the
tick. With time_after_eq, the update period will between 28 and
32ms.

Then, I mention a possible optimization by using time_after_eq in the
idle_balance() so a newly_idle cpu will have more chance (between 0
and 4ms for hz250) to do the update before a ilb is kicked

Thanks,
Vincent

>
>
> No update (idle system)
> -----------------------
> Nothing special here, just making sure nothing happens when the system is
> fully idle. On a sidenote, that's relatively hard to achieve - I had to
> switch over to Juno because my HiKey960 gets interrupts every 16ms. The Juno
> still gets woken up every now and then but it's a bit quieter.
>
>
> [1]: https://gist.github.com/valschneider/a8da7bb8e11fb1ec63a419710f56c0a0
>
>
>
> On 01/24/2018 08:25 AM, Vincent Guittot wrote:
>>
>> Hi,
>>
>> Le Thursday 18 Jan 2018 Ã 10:38:07 (+0000), Morten Rasmussen a Ãcrit :
>>>
>>> On Mon, Jan 15, 2018 at 09:26:09AM +0100, Vincent Guittot wrote:
>>>>
>>>> Le Wednesday 03 Jan 2018 Ã 10:16:00 (+0100), Vincent Guittot a Ãcrit :
>>>>>
>>>>> Hi Peter,
>>>>>
>>>>> On 22 December 2017 at 21:42, Peter Zijlstra <peterz@xxxxxxxxxxxxx>
>>>>> wrote:
>>>>>>
>>>>>> On Fri, Dec 22, 2017 at 07:56:29PM +0100, Peter Zijlstra wrote:
>>>>>>>
>>>>>>> Right; but I figured we'd try and do it 'right' and see how horrible
>>>>>>> it
>>>>>>> is before we try and do funny things.
>>>>>>
>>>>>> So now it should have a 32ms tick for up to .5s when the system goes
>>>>>> completely idle.
>>>>>>
>>>>>> No idea how bad that is..
>>>>>
>>>>> I have tested your branch but the timer doesn't seem to fire correctly
>>>>> because i can still see blocked load in the use case i have run.
>>>>> I haven't found the reason yet
>>>>
>>>> Hi Peter,
>>>>
>>>> With the patch below on top of your branch, the blocked loads are
>>>> updated and
>>>> decayed regularly. The main differences are:
>>>> - It doesn't use a timer to trig ilb but the tick and when a cpu becomes
>>>> idle.
>>>> The main drawback of this solution is that the load is blocked when
>>>> the
>>>> system is fully idle with the advantage of not waking up a fully idle
>>>> system. We have to wait for the next tick or newly idle event for
>>>> updating
>>>> blocked load when the system leaves idle stat which can be up to a
>>>> tick long.
>>>> If this is too long, we can check for kicking ilb when task wakes up
>>>> so the
>>>> blocked load will be updated as soon as the system leaves idle state.
>>>> The main advantage is that we don't wake up a fully idle system every
>>>> 32ms to
>>>> update blocked load that will be not used.
>>>> - I'm working on one more improvement to use nohz_idle_balance in the
>>>> newly
>>>> idle case when the system is not overloaded and
>>>> (this_rq->avg_idle > sysctl_sched_migration_cost). In this case, we
>>>> can try to
>>>> use nohz_idle_balance with NOHZ_STATS_KICK and abort as soon as it
>>>> exceed
>>>> this_rq->avg_idle. This will remove some calls to kick_ilb and some
>>>> wake up
>>>> of an idle cpus.
>>>
>>> This sound like what I meant in my other reply :-)
>>>
>>> It seems pointless to have a timer to update PELT if the system is
>>> completely idle, and when it isn't we can piggy back other events to
>>> make the updates happen.
>>
>> The patch below implements what has been described above. It calls part of
>> nohz_idle_balance when a cpu becomes idle and kick a ilb if it takes too
>> much
>> time. This removes part of ilb that are kicked on an idle cpu for updating
>> the blocked load but the ratio really depends on when the tick happens
>> compared
>> to a cpu becoming idle and the 32ms boundary. I have an additionnal patch
>> that
>> enables to update the blocked loads when a cpu becomes idle 1 period
>> before
>> kicking an ilb and there is far less ilb because we give more chance to
>> the
>> newly idle case (time_after is replaced by time_after_eq in
>> idle_balance()).
>>
>> The patch also uses a function cfs_rq_has_blocked, which only checks the
>> util/load_avg, instead of the cfs_rq_is_decayed which check *_sum too.
>> This
>> reduce significantly the number of update of blocked load. the *_avg will
>> be
>> fully decayed in around 300~400ms but it's far longer for the *_sum which
>> have
>> a higher resolution and we can easily reach almost seconds. But only the
>> *_avg
>> are used to make decision so keeping some blocked *_sum is acceptable.
>>
>> ---
>> kernel/sched/fair.c | 121
>> +++++++++++++++++++++++++++++++++++++++-------------
>> 1 file changed, 92 insertions(+), 29 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 898785d..ed90303 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -7356,6 +7356,17 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq
>> *cfs_rq)
>> return true;
>> }
>> +static inline bool cfs_rq_has_blocked(struct cfs_rq *cfs_rq)
>> +{
>> + if (cfs_rq->avg.load_avg)
>> + return true;
>> +
>> + if (cfs_rq->avg.util_avg)
>> + return true;
>> +
>> + return false;
>> +}
>> +
>> #ifdef CONFIG_FAIR_GROUP_SCHED
>> static void update_blocked_averages(int cpu)
>> @@ -7393,7 +7404,9 @@ static void update_blocked_averages(int cpu)
>> */
>> if (cfs_rq_is_decayed(cfs_rq))
>> list_del_leaf_cfs_rq(cfs_rq);
>> - else
>> +
>> + /* Don't need periodic decay once load/util_avg are null
>> */
>> + if (cfs_rq_has_blocked(cfs_rq))
>> done = false;
>> }
>> @@ -7463,7 +7476,7 @@ static inline void update_blocked_averages(int
>> cpu)
>> update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq);
>> #ifdef CONFIG_NO_HZ_COMMON
>> rq->last_blocked_load_update_tick = jiffies;
>> - if (cfs_rq_is_decayed(cfs_rq))
>> + if (cfs_rq_has_blocked(cfs_rq))
>> rq->has_blocked_load = 0;
>> #endif
>> rq_unlock_irqrestore(rq, &rf);
>> @@ -8818,6 +8831,7 @@ update_next_balance(struct sched_domain *sd,
>> unsigned long *next_balance)
>> *next_balance = next;
>> }
>> +static bool _nohz_idle_balance(struct rq *this_rq, unsigned int flags,
>> enum cpu_idle_type idle);
>> static void kick_ilb(unsigned int flags);
>> /*
>> @@ -8861,7 +8875,14 @@ static int idle_balance(struct rq *this_rq, struct
>> rq_flags *rf)
>> update_next_balance(sd, &next_balance);
>> rcu_read_unlock();
>> - if (time_after(jiffies, next) &&
>> atomic_read(&nohz.stats_state))
>> + /*
>> + * Update blocked idle load if it has not been done for a
>> + * while. Try to do it locally before entering idle but
>> kick a
>> + * ilb if it takes too much time and might delay next
>> local
>> + * wake up
>> + */
>> + if (time_after(jiffies, next) &&
>> atomic_read(&nohz.stats_state) &&
>> + !_nohz_idle_balance(this_rq,
>> NOHZ_STATS_KICK, CPU_NEWLY_IDLE))
>> kick_ilb(NOHZ_STATS_KICK);
>> goto out;
>> @@ -9237,6 +9258,7 @@ void nohz_balance_enter_idle(int cpu)
>> if (!housekeeping_cpu(cpu, HK_FLAG_SCHED))
>> return;
>> + rq->has_blocked_load = 1;
>> if (rq->nohz_tick_stopped)
>> return;
>> @@ -9247,7 +9269,6 @@ void nohz_balance_enter_idle(int cpu)
>> return;
>> rq->nohz_tick_stopped = 1;
>> - rq->has_blocked_load = 1;
>> cpumask_set_cpu(cpu, nohz.idle_cpus_mask);
>> atomic_inc(&nohz.nr_cpus);
>> @@ -9259,7 +9280,6 @@ void nohz_balance_enter_idle(int cpu)
>> * enable the periodic update of the load of idle cpus
>> */
>> atomic_set(&nohz.stats_state, 1);
>> -
>> }
>> #else
>> static inline void nohz_balancer_kick(struct rq *rq) { }
>> @@ -9385,10 +9405,13 @@ static void rebalance_domains(struct rq *rq, enum
>> cpu_idle_type idle)
>> #ifdef CONFIG_NO_HZ_COMMON
>> /*
>> - * In CONFIG_NO_HZ_COMMON case, the idle balance kickee will do the
>> - * rebalancing for all the cpus for whom scheduler ticks are stopped.
>> + * Internal function that runs load balance for all idle cpus. The load
>> balance
>> + * can be a simple update of blocked load or a complete load balance with
>> + * tasks movement depending of flags.
>> + * For newly idle mode, we abort the loop if it takes too much time and
>> return
>> + * false to notify that the loop has not be completed and a ilb shoud be
>> kick.
>> */
>> -static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type
>> idle)
>> +static bool _nohz_idle_balance(struct rq *this_rq, unsigned int flags,
>> enum cpu_idle_type idle)
>> {
>> /* Earliest time when we have to do rebalance again */
>> unsigned long now = jiffies;
>> @@ -9396,24 +9419,10 @@ static bool nohz_idle_balance(struct rq *this_rq,
>> enum cpu_idle_type idle)
>> bool has_blocked_load = false;
>> int update_next_balance = 0;
>> int this_cpu = this_rq->cpu;
>> - unsigned int flags;
>> int balance_cpu;
>> + int ret = false;
>> struct rq *rq;
>> -
>> - if (!(atomic_read(nohz_flags(this_cpu)) & NOHZ_KICK_MASK))
>> - return false;
>> -
>> - if (idle != CPU_IDLE) {
>> - atomic_andnot(NOHZ_KICK_MASK, nohz_flags(this_cpu));
>> - return false;
>> - }
>> -
>> - /*
>> - * barrier, pairs with nohz_balance_enter_idle(), ensures ...
>> - */
>> - flags = atomic_fetch_andnot(NOHZ_KICK_MASK, nohz_flags(this_cpu));
>> - if (!(flags & NOHZ_KICK_MASK))
>> - return false;
>> + u64 curr_cost = 0;
>> SCHED_WARN_ON((flags & NOHZ_KICK_MASK) == NOHZ_BALANCE_KICK);
>> @@ -9428,6 +9437,10 @@ static bool nohz_idle_balance(struct rq *this_rq,
>> enum cpu_idle_type idle)
>> atomic_set(&nohz.stats_state, 0);
>> for_each_cpu(balance_cpu, nohz.idle_cpus_mask) {
>> + u64 t0, domain_cost;
>> +
>> + t0 = sched_clock_cpu(this_cpu);
>> +
>> if (balance_cpu == this_cpu || !idle_cpu(balance_cpu))
>> continue;
>> @@ -9438,7 +9451,17 @@ static bool nohz_idle_balance(struct rq *this_rq,
>> enum cpu_idle_type idle)
>> */
>> if (need_resched()) {
>> has_blocked_load = true;
>> - break;
>> + goto abort;
>> + }
>> +
>> + /*
>> + * If the update is done while CPU becomes idle, we abort
>> + * the update when its cost is higher than the average
>> idle
>> + * time in orde to not delay a possible wake up.
>> + */
>> + if (idle == CPU_NEWLY_IDLE && this_rq->avg_idle <
>> curr_cost) {
>> + has_blocked_load = true;
>> + goto abort;
>> }
>> rq = cpu_rq(balance_cpu);
>> @@ -9453,10 +9476,10 @@ static bool nohz_idle_balance(struct rq *this_rq,
>> enum cpu_idle_type idle)
>> if (time_after_eq(jiffies, rq->next_balance)) {
>> struct rq_flags rf;
>> - rq_lock_irq(rq, &rf);
>> + rq_lock_irqsave(rq, &rf);
>> update_rq_clock(rq);
>> cpu_load_update_idle(rq);
>> - rq_unlock_irq(rq, &rf);
>> + rq_unlock_irqrestore(rq, &rf);
>> if (flags & NOHZ_BALANCE_KICK)
>> rebalance_domains(rq, CPU_IDLE);
>> @@ -9466,10 +9489,17 @@ static bool nohz_idle_balance(struct rq *this_rq,
>> enum cpu_idle_type idle)
>> next_balance = rq->next_balance;
>> update_next_balance = 1;
>> }
>> +
>> + domain_cost = sched_clock_cpu(this_cpu) - t0;
>> + curr_cost += domain_cost;
>> +
>> }
>> - update_blocked_averages(this_cpu);
>> - has_blocked_load |= this_rq->has_blocked_load;
>> + /* Newly idle CPU doesn't need an update */
>> + if (idle != CPU_NEWLY_IDLE) {
>> + update_blocked_averages(this_cpu);
>> + has_blocked_load |= this_rq->has_blocked_load;
>> + }
>> if (flags & NOHZ_BALANCE_KICK)
>> rebalance_domains(this_rq, CPU_IDLE);
>> @@ -9477,6 +9507,10 @@ static bool nohz_idle_balance(struct rq *this_rq,
>> enum cpu_idle_type idle)
>> WRITE_ONCE(nohz.next_stats,
>> now + msecs_to_jiffies(LOAD_AVG_PERIOD));
>> + /* The full idle balance loop has been done */
>> + ret = true;
>> +
>> +abort:
>> /* There is still blocked load, enable periodic update */
>> if (has_blocked_load)
>> atomic_set(&nohz.stats_state, 1);
>> @@ -9489,6 +9523,35 @@ static bool nohz_idle_balance(struct rq *this_rq,
>> enum cpu_idle_type idle)
>> if (likely(update_next_balance))
>> nohz.next_balance = next_balance;
>> + return ret;
>> +}
>> +
>> +/*
>> + * In CONFIG_NO_HZ_COMMON case, the idle balance kickee will do the
>> + * rebalancing for all the cpus for whom scheduler ticks are stopped.
>> + */
>> +static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type
>> idle)
>> +{
>> + int this_cpu = this_rq->cpu;
>> + unsigned int flags;
>> +
>> + if (!(atomic_read(nohz_flags(this_cpu)) & NOHZ_KICK_MASK))
>> + return false;
>> +
>> + if (idle != CPU_IDLE) {
>> + atomic_andnot(NOHZ_KICK_MASK, nohz_flags(this_cpu));
>> + return false;
>> + }
>> +
>> + /*
>> + * barrier, pairs with nohz_balance_enter_idle(), ensures ...
>> + */
>> + flags = atomic_fetch_andnot(NOHZ_KICK_MASK, nohz_flags(this_cpu));
>> + if (!(flags & NOHZ_KICK_MASK))
>> + return false;
>> +
>> + _nohz_idle_balance(this_rq, flags, idle);
>> +
>> return true;
>> }
>> #else
>
>