Re: [PATCH] sched/fair: update scale invariance of PELT

From: Vincent Guittot
Date: Wed Mar 29 2017 - 04:06:04 EST


On 28 March 2017 at 17:35, Vincent Guittot <vincent.guittot@xxxxxxxxxx> wrote:
> The current implementation of load tracking invariance scales the contribution
> with current frequency and uarch performance (only for utilization) of the
> CPU. One main result of this formula is that the figures are capped by current
> capacity of CPU. Another one is that the load_avg is not invariant because not
> scaled with uarch.
>
> The util_avg of a periodic task that runs r time slots every p time slots
> varies in the range :
>
> U * (1-y^r)/(1-y^p) * y^i < Utilization < U * (1-y^r)/(1-y^p)
>
> with U is the max util_avg value = SCHED_CAPACITY_SCALE
>
> At a lower capacity, the range becomes:
>
> U * C * (1-y^r')/(1-y^p) * y^i' < Utilization < U * C * (1-y^r')/(1-y^p)
>
> with C reflecting the compute capacity ratio between current capacity and
> max capacity.
>
> so C tries to compensate changes in (1-y^r') but it can't be accurate.
>
> Instead of scaling the contribution value of PELT algo, we should scale the
> running time. The PELT signal aims to track the amount of computation of tasks
> and/or rq so it seems more correct to scale the running time to reflect the
> effective amount of computation done since the last update.
>
> In order to be fully invariant, we need to apply the same amount of running
> time and idle time whatever the current capacity. Because running at lower
> capacity implies that the task will run longer, we have to track the amount of
> "stolen" idle time and to apply it when task becomes idle.
>
> But once we have reached the maximum utilization value (SCHED_CAPACITY_SCALE),
> it means that the task is seen as an always-running task whatever the capacity of
> the cpu (even at max compute capacity). In this case, we can discard the "stolen"
> idle times which becomes meaningless. In order to cope with rounding effect of
> PELT algo we take a margin and consider task with utilization greater than 1000
> (vs 1024 max) as an always-running task.
>
> Then, we can use the same algorithm for both utilization and load and
> simplify __update_load_avg now that the load of a task doesn't have to be
> capped by CPU uarch.
>
> The responsivness of PELT is improved when CPU is not running at max
> capacity with this new algorithm. I have put below some examples of
> duration to reach some typical load values according to the capacity of the
> CPU with current implementation and with this patch.
>
> Util (%) max capacity half capacity(mainline) half capacity(w/ patch)
> 972 (95%) 138ms not reachable 276ms
> 486 (47.5%) 30ms 138ms 60ms
> 256 (25%) 13ms 32ms 26ms
>
> On my hikey (octo ARM platform) with schedutil governor, the time to reach
> max OPP when starting from a null utilization, decreases from 223ms with
> current scale invariance down to 121ms with the new algorithm. For this
> test, i have enable arch_scale_freq for arm64.
>
> Signed-off-by: Vincent Guittot <vincent.guittot@xxxxxxxxxx>
> ---

I just noticed that i forgot to mentioned that this patch was a v2 and
the discussion
about v1 can be found here:
https://lkml.org/lkml/2015/11/24/432

Change since v1:
* track stolen idle time to make PELT fully invariant

Vincent

> include/linux/sched.h | 1 +
> kernel/sched/fair.c | 49 ++++++++++++++++++++++++++++++++++---------------
> 2 files changed, 35 insertions(+), 15 deletions(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index d67eee8..ca9d00f 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -313,6 +313,7 @@ struct load_weight {
> */
> struct sched_avg {
> u64 last_update_time;
> + u64 stolen_idle_time;
> u64 load_sum;
> u32 util_sum;
> u32 period_contrib;
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 31453d5..d1514cb 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -735,6 +735,7 @@ void init_entity_runnable_average(struct sched_entity *se)
> struct sched_avg *sa = &se->avg;
>
> sa->last_update_time = 0;
> + sa->stolen_idle_time = 0;
> /*
> * sched_avg's period_contrib should be strictly less then 1024, so
> * we give it 1023 to make sure it is almost a period (1024us), and
> @@ -2852,10 +2853,9 @@ static __always_inline int
> __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
> unsigned long weight, int running, struct cfs_rq *cfs_rq)
> {
> - u64 delta, scaled_delta, periods;
> + u64 delta, periods;
> u32 contrib;
> - unsigned int delta_w, scaled_delta_w, decayed = 0;
> - unsigned long scale_freq, scale_cpu;
> + unsigned int delta_w, decayed = 0;
>
> delta = now - sa->last_update_time;
> /*
> @@ -2876,8 +2876,30 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
> return 0;
> sa->last_update_time = now;
>
> - scale_freq = arch_scale_freq_capacity(NULL, cpu);
> - scale_cpu = arch_scale_cpu_capacity(NULL, cpu);
> + if (running) {
> + sa->stolen_idle_time += delta;
> + /*
> + * scale the elapsed time to reflect the real amount of
> + * computation
> + */
> + delta = cap_scale(delta, arch_scale_freq_capacity(NULL, cpu));
> + delta = cap_scale(delta, arch_scale_cpu_capacity(NULL, cpu));
> +
> + /*
> + * Track the amount of stolen idle time due to running at
> + * lower capacity
> + */
> + sa->stolen_idle_time -= delta;
> + } else if (!weight) {
> + if (sa->util_sum < (LOAD_AVG_MAX * 1000)) {
> + /*
> + * Add the idle time stolen by running at lower compute
> + * capacity
> + */
> + delta += sa->stolen_idle_time;
> + }
> + sa->stolen_idle_time = 0;
> + }
>
> /* delta_w is the amount already accumulated against our next period */
> delta_w = sa->period_contrib;
> @@ -2893,16 +2915,15 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
> * period and accrue it.
> */
> delta_w = 1024 - delta_w;
> - scaled_delta_w = cap_scale(delta_w, scale_freq);
> if (weight) {
> - sa->load_sum += weight * scaled_delta_w;
> + sa->load_sum += weight * delta_w;
> if (cfs_rq) {
> cfs_rq->runnable_load_sum +=
> - weight * scaled_delta_w;
> + weight * delta_w;
> }
> }
> if (running)
> - sa->util_sum += scaled_delta_w * scale_cpu;
> + sa->util_sum += delta_w << SCHED_CAPACITY_SHIFT;
>
> delta -= delta_w;
>
> @@ -2919,25 +2940,23 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
>
> /* Efficiently calculate \sum (1..n_period) 1024*y^i */
> contrib = __compute_runnable_contrib(periods);
> - contrib = cap_scale(contrib, scale_freq);
> if (weight) {
> sa->load_sum += weight * contrib;
> if (cfs_rq)
> cfs_rq->runnable_load_sum += weight * contrib;
> }
> if (running)
> - sa->util_sum += contrib * scale_cpu;
> + sa->util_sum += contrib << SCHED_CAPACITY_SHIFT;
> }
>
> /* Remainder of delta accrued against u_0` */
> - scaled_delta = cap_scale(delta, scale_freq);
> if (weight) {
> - sa->load_sum += weight * scaled_delta;
> + sa->load_sum += weight * delta;
> if (cfs_rq)
> - cfs_rq->runnable_load_sum += weight * scaled_delta;
> + cfs_rq->runnable_load_sum += weight * delta;
> }
> if (running)
> - sa->util_sum += scaled_delta * scale_cpu;
> + sa->util_sum += delta << SCHED_CAPACITY_SHIFT;
>
> sa->period_contrib += delta;
>
> --
> 2.7.4
>