Re: [PATCH v3] sched/fair: update scale invariance of PELT

From: Peter Zijlstra
Date: Fri May 18 2018 - 04:40:30 EST



Replying to the latest version available; given the current interest I
figure I'd re-read some of the old threads and look at this stuff again.

On Fri, Apr 28, 2017 at 04:23:55PM +0200, Vincent Guittot wrote:

> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 0978fb7..f8dde36 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -313,6 +313,7 @@ struct load_weight {
> */
> struct sched_avg {
> u64 last_update_time;
> + u64 stolen_idle_time;
> u64 load_sum;
> u32 util_sum;
> u32 period_contrib;

Right, so sadly Patrick stole that space with the util_est bits.

Also, given the comment here:

https://marc.info/?l=linux-kernel&m=149373232422941&w=2

this should be a u32, right? Which might be slightly easier finding a
hole for.

> /*
> + * Scale the time to reflect the effective amount of computation done during
> + * this delta time.

I would much appreciate a more extended comment here. One that includes
pictures of the of the moving window edges, as in:

https://marc.info/?l=linux-kernel&m=149200866116792&w=2
https://marc.info/?l=linux-kernel&m=149201190517985&w=2

> + */
> +static __always_inline u64
> +scale_time(u64 delta, int cpu, struct sched_avg *sa,
> + unsigned long weight, int running)
> +{
> + if (running) {
> + /*
> + * When an entity runs at a lower compute capacity, it will
> + * need more time to do the same amount of work than at max
> + * capacity. In order to be invariant, we scale the delta to
> + * reflect how much work has been really done.
> + * Running at lower capacity also means running longer to do
> + * the same amount of work and this results in stealing some
> + * idle time that will disturbed the load signal compared to
> + * max capacity; We also track this amount of stolen time to
> + * reflect it when the entity will go back to sleep.
> + *
> + * stolen time = (current run time) - (effective time at max
> + * capacity)
> + */
> + sa->stolen_idle_time += delta;
> +
> + /*
> + * scale the elapsed time to reflect the real amount of
> + * computation
> + */
> + delta = cap_scale(delta, arch_scale_freq_capacity(NULL, cpu));
> + delta = cap_scale(delta, arch_scale_cpu_capacity(NULL, cpu));
> +
> + /*
> + * Track the amount of stolen idle time due to running at
> + * lower capacity
> + */
> + sa->stolen_idle_time -= delta;
> + } else if (!weight) {
> + /*
> + * Entity is sleeping so both utilization and load will decay
> + * and we can safely add the stolen time. Reflecting some
> + * stolen time make sense only if this idle phase would be
> + * present at max capacity. As soon as the utilization of an
> + * entity has reached the maximum value, it is considered as
> + * an always runnnig entity without idle time to steal.
> + */
> + if (sa->util_avg < (SCHED_CAPACITY_SCALE - 1)) {
> + /*
> + * Add the idle time stolen by running at lower compute
> + * capacity
> + */
> + delta += sa->stolen_idle_time;
> + }
> + sa->stolen_idle_time = 0;
> + }

What happened to the proposed changes here:

https://marc.info/?l=linux-kernel&m=149383148721909&w=2

to deal with the load scaling issues?

> +
> + return delta;
> +}