Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK
From: Vincent Guittot
Date: Tue Jan 30 2018 - 03:01:01 EST
On 29 January 2018 at 19:43, Dietmar Eggemann <dietmar.eggemann@xxxxxxx> wrote:
> On 01/24/2018 09:25 AM, Vincent Guittot wrote:
>>
>> Hi,
>>
>> Le Thursday 18 Jan 2018 Ã 10:38:07 (+0000), Morten Rasmussen a Ãcrit :
>>>
>>> On Mon, Jan 15, 2018 at 09:26:09AM +0100, Vincent Guittot wrote:
>>>>
>>>> Le Wednesday 03 Jan 2018 Ã 10:16:00 (+0100), Vincent Guittot a Ãcrit :
>
>
> [...]
>
>
>>>>
>>>> Hi Peter,
>>>>
>>>> With the patch below on top of your branch, the blocked loads are
>>>> updated and
>>>> decayed regularly. The main differences are:
>>>> - It doesn't use a timer to trig ilb but the tick and when a cpu becomes
>>>> idle.
>>>> The main drawback of this solution is that the load is blocked when
>>>> the
>>>> system is fully idle with the advantage of not waking up a fully idle
>>>> system. We have to wait for the next tick or newly idle event for
>>>> updating
>>>> blocked load when the system leaves idle stat which can be up to a
>>>> tick long.
>>>> If this is too long, we can check for kicking ilb when task wakes up
>>>> so the
>>>> blocked load will be updated as soon as the system leaves idle state.
>>>> The main advantage is that we don't wake up a fully idle system every
>>>> 32ms to
>>>> update blocked load that will be not used.
>>>> - I'm working on one more improvement to use nohz_idle_balance in the
>>>> newly
>>>> idle case when the system is not overloaded and
>>>> (this_rq->avg_idle > sysctl_sched_migration_cost). In this case, we
>>>> can try to
>>>> use nohz_idle_balance with NOHZ_STATS_KICK and abort as soon as it
>>>> exceed
>>>> this_rq->avg_idle. This will remove some calls to kick_ilb and some
>>>> wake up
>>>> of an idle cpus.
>>>
>>>
>>> This sound like what I meant in my other reply :-)
>>>
>>> It seems pointless to have a timer to update PELT if the system is
>>> completely idle, and when it isn't we can piggy back other events to
>>> make the updates happen.
>>
>>
>> The patch below implements what has been described above. It calls part of
>> nohz_idle_balance when a cpu becomes idle and kick a ilb if it takes too
>> much
>> time. This removes part of ilb that are kicked on an idle cpu for updating
>> the blocked load but the ratio really depends on when the tick happens
>> compared
>> to a cpu becoming idle and the 32ms boundary. I have an additionnal patch
>> that
>> enables to update the blocked loads when a cpu becomes idle 1 period
>> before
>> kicking an ilb and there is far less ilb because we give more chance to
>> the
>> newly idle case (time_after is replaced by time_after_eq in
>> idle_balance()).
>>
>> The patch also uses a function cfs_rq_has_blocked, which only checks the
>> util/load_avg, instead of the cfs_rq_is_decayed which check *_sum too.
>> This
>> reduce significantly the number of update of blocked load. the *_avg will
>> be
>> fully decayed in around 300~400ms but it's far longer for the *_sum which
>> have
>> a higher resolution and we can easily reach almost seconds. But only the
>> *_avg
>> are used to make decision so keeping some blocked *_sum is acceptable.
>>
>> ---
>> kernel/sched/fair.c | 121
>> +++++++++++++++++++++++++++++++++++++++-------------
>> 1 file changed, 92 insertions(+), 29 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 898785d..ed90303 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -7356,6 +7356,17 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq
>> *cfs_rq)
>> return true;
>> }
>> +static inline bool cfs_rq_has_blocked(struct cfs_rq *cfs_rq)
>> +{
>> + if (cfs_rq->avg.load_avg)
>> + return true;
>> +
>> + if (cfs_rq->avg.util_avg)
>> + return true;
>> +
>> + return false;
>> +}
>> +
>
>
> Can we not change cfs_rq_is_decayed() to use avg.foo_avg instead of
> avg.foo_sum ?
I don't think so because the *_sum are used to keep coherency bewteen
cfs_rq and cgroups when task migrates and are enqueued/dequeued so wwe
can't remove it until the *_sum are null otherwise the all cfs_rq and
group will be out of sync
>
>> #ifdef CONFIG_FAIR_GROUP_SCHED
>> static void update_blocked_averages(int cpu)
>> @@ -7393,7 +7404,9 @@ static void update_blocked_averages(int cpu)
>> */
>> if (cfs_rq_is_decayed(cfs_rq))
>> list_del_leaf_cfs_rq(cfs_rq);
>> - else
>> +
>> + /* Don't need periodic decay once load/util_avg are null
>> */
>> + if (cfs_rq_has_blocked(cfs_rq))
>> done = false;
>> }
>> @@ -7463,7 +7476,7 @@ static inline void update_blocked_averages(int
>> cpu)
>> update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq);
>> #ifdef CONFIG_NO_HZ_COMMON
>> rq->last_blocked_load_update_tick = jiffies;
>> - if (cfs_rq_is_decayed(cfs_rq))
>> + if (cfs_rq_has_blocked(cfs_rq))
>
>
> Schouldn't this be !cfs_rq_has_blocked(cfs_rq) ?
yes. I copy/pasted too quickly from sched_group_fair to not sched_group_fair
>
>> rq->has_blocked_load = 0;
>> #endif
>> rq_unlock_irqrestore(rq, &rf);
>
>
> [...]
>
>
>> @@ -9438,7 +9451,17 @@ static bool nohz_idle_balance(struct rq *this_rq,
>> enum cpu_idle_type idle)
>> */
>> if (need_resched()) {
>> has_blocked_load = true;
>> - break;
>> + goto abort;
>> + }
>> +
>> + /*
>> + * If the update is done while CPU becomes idle, we abort
>> + * the update when its cost is higher than the average
>> idle
>> + * time in orde to not delay a possible wake up.
>> + */
>> + if (idle == CPU_NEWLY_IDLE && this_rq->avg_idle <
>> curr_cost) {
>> + has_blocked_load = true;
>> + goto abort;
>> }
>> rq = cpu_rq(balance_cpu);
>> @@ -9453,10 +9476,10 @@ static bool nohz_idle_balance(struct rq *this_rq,
>> enum cpu_idle_type idle)
>> if (time_after_eq(jiffies, rq->next_balance)) {
>> struct rq_flags rf;
>> - rq_lock_irq(rq, &rf);
>> + rq_lock_irqsave(rq, &rf);
>> update_rq_clock(rq);
>> cpu_load_update_idle(rq);
>> - rq_unlock_irq(rq, &rf);
>> + rq_unlock_irqrestore(rq, &rf);
>> if (flags & NOHZ_BALANCE_KICK)
>> rebalance_domains(rq, CPU_IDLE);
>> @@ -9466,10 +9489,17 @@ static bool nohz_idle_balance(struct rq *this_rq,
>> enum cpu_idle_type idle)
>> next_balance = rq->next_balance;
>> update_next_balance = 1;
>> }
>
>
> Why do you do this cpu_load_update_idle(rq) even this was called with
> CPU_NEWLY_IDLE? Wouldn't it be sufficient to jump to the curr_cost
> calculation in this case?
just to keep thing similar to what happen with kick_ilb and that's an
occasion to also update the cpu_load
>
>> +
>> + domain_cost = sched_clock_cpu(this_cpu) - t0;
>> + curr_cost += domain_cost;
>> +
>> }
>> - update_blocked_averages(this_cpu);
>> - has_blocked_load |= this_rq->has_blocked_load;
>> + /* Newly idle CPU doesn't need an update */
>> + if (idle != CPU_NEWLY_IDLE) {
>> + update_blocked_averages(this_cpu);
>> + has_blocked_load |= this_rq->has_blocked_load;
>> + }
>> if (flags & NOHZ_BALANCE_KICK)
>> rebalance_domains(this_rq, CPU_IDLE);
>
>
> [...]