Re: [PATCH 3/6] sched/fair: Make utilization tracking cpu scale-invariant
From: Vincent Guittot
Date: Fri Sep 04 2015 - 03:53:15 EST
On 15 August 2015 at 01:04, Dietmar Eggemann <dietmar.eggemann@xxxxxxx> wrote:
> On 14/08/15 17:23, Morten Rasmussen wrote:
>> From: Dietmar Eggemann <dietmar.eggemann@xxxxxxx>
>
> [...]
>
>> @@ -2596,7 +2597,7 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
>> }
>> }
>> if (running)
>> - sa->util_sum += scaled_delta_w;
>> + sa->util_sum = scale(scaled_delta_w, scale_cpu);
>
>
> There is a small issue (using = instead of +=) with fatal consequences
> for the utilization signal.
>
> -- >8 --
>
> Subject: [PATCH] sched/fair: Make utilization tracking cpu scale-invariant
>
> Besides the existing frequency scale-invariance correction factor, apply
> cpu scale-invariance correction factor to utilization tracking to
> compensate for any differences in compute capacity. This could be due to
> micro-architectural differences (i.e. instructions per seconds) between
> cpus in HMP systems (e.g. big.LITTLE), and/or differences in the current
> maximum frequency supported by individual cpus in SMP systems. In the
> existing implementation utilization isn't comparable between cpus as it
> is relative to the capacity of each individual cpu.
>
> Each segment of the sched_avg.util_sum geometric series is now scaled
> by the cpu performance factor too so the sched_avg.util_avg of each
> sched entity will be invariant from the particular cpu of the HMP/SMP
> system on which the sched entity is scheduled.
>
> With this patch, the utilization of a cpu stays relative to the max cpu
> performance of the fastest cpu in the system.
>
> In contrast to utilization (sched_avg.util_sum), load
> (sched_avg.load_sum) should not be scaled by compute capacity. The
> utilization metric is based on running time which only makes sense when
> cpus are _not_ fully utilized (utilization cannot go beyond 100% even if
> more tasks are added), where load is runnable time which isn't limited
> by the capacity of the cpu and therefore is a better metric for
> overloaded scenarios. If we run two nice-0 busy loops on two cpus with
> different compute capacity their load should be similar since their
> compute demands are the same. We have to assume that the compute demand
> of any task running on a fully utilized cpu (no spare cycles = 100%
> utilization) is high and the same no matter of the compute capacity of
> its current cpu, hence we shouldn't scale load by cpu capacity.
>
> Cc: Ingo Molnar <mingo@xxxxxxxxxx>
> Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
> Signed-off-by: Dietmar Eggemann <dietmar.eggemann@xxxxxxx>
> Signed-off-by: Morten Rasmussen <morten.rasmussen@xxxxxxx>
> ---
> include/linux/sched.h | 2 +-
> kernel/sched/fair.c | 7 ++++---
> kernel/sched/sched.h | 2 +-
> 3 files changed, 6 insertions(+), 5 deletions(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index a15305117ace..78a93d716fcb 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1180,7 +1180,7 @@ struct load_weight {
> * 1) load_avg factors frequency scaling into the amount of time that a
> * sched_entity is runnable on a rq into its weight. For cfs_rq, it is the
> * aggregated such weights of all runnable and blocked sched_entities.
> - * 2) util_avg factors frequency scaling into the amount of time
> + * 2) util_avg factors frequency and cpu scaling into the amount of time
> * that a sched_entity is running on a CPU, in the range [0..SCHED_LOAD_SCALE].
> * For cfs_rq, it is the aggregated such times of all runnable and
> * blocked sched_entities.
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index c72223a299a8..3321eb13e422 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2553,6 +2553,7 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
> u32 contrib;
> int delta_w, scaled_delta_w, decayed = 0;
> unsigned long scale_freq = arch_scale_freq_capacity(NULL, cpu);
> + unsigned long scale_cpu = arch_scale_cpu_capacity(NULL, cpu);
>
> delta = now - sa->last_update_time;
> /*
> @@ -2596,7 +2597,7 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
> }
> }
> if (running)
> - sa->util_sum += scaled_delta_w;
> + sa->util_sum += scale(scaled_delta_w, scale_cpu);
>
> delta -= delta_w;
>
> @@ -2620,7 +2621,7 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
> cfs_rq->runnable_load_sum += weight * contrib;
> }
> if (running)
> - sa->util_sum += contrib;
> + sa->util_sum += scale(contrib, scale_cpu);
> }
>
> /* Remainder of delta accrued against u_0` */
> @@ -2631,7 +2632,7 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
> cfs_rq->runnable_load_sum += weight * scaled_delta;
> }
> if (running)
> - sa->util_sum += scaled_delta;
> + sa->util_sum += scale(scaled_delta, scale_cpu);
>
> sa->period_contrib += delta;
>
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 7e6f2506a402..50836a9301f9 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1406,7 +1406,7 @@ unsigned long arch_scale_freq_capacity(struct sched_domain *sd, int cpu)
> static __always_inline
> unsigned long arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
> {
> - if ((sd->flags & SD_SHARE_CPUCAPACITY) && (sd->span_weight > 1))
> + if (sd && (sd->flags & SD_SHARE_CPUCAPACITY) && (sd->span_weight > 1))
> return sd->smt_gain / sd->span_weight;
>
> return SCHED_CAPACITY_SCALE;
FWIW, you can add my Acked-by: Vincent Guittot
<vincent.guittot@xxxxxxxxxx> on this corrected version
> --
> 1.9.1
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/