Re: [PATCH v2] sched/fair: update scale invariance of PELT
From: Peter Zijlstra
Date: Mon Apr 10 2017 - 13:38:22 EST
Thanks for the rebase.
On Mon, Apr 10, 2017 at 11:18:29AM +0200, Vincent Guittot wrote:
Ok, so let me try and paraphrase what this patch does.
So consider a task that runs 16 out of our 32ms window:
running idle
|---------|---------|
You're saying that when we scale running with the frequency, suppose we
were at 50% freq, we'll end up with:
run idle
|----|---------|
Which is obviously a shorter total then before; so what you do is add
back the lost idle time like:
run lost idle
|----|----|---------|
to arrive at the same total time. Which seems to make sense.
Now I have vague memories of Morten having issues with your previous
patches, so I'll wait for him to chime in as well.
On to the implementation:
> /*
> + * Scale the time to reflect the effective amount of computation done during
> + * this delta time.
> + */
> +static __always_inline u64
> +scale_time(u64 delta, int cpu, struct sched_avg *sa,
> + unsigned long weight, int running)
> +{
> + if (running) {
> + sa->stolen_idle_time += delta;
> + /*
> + * scale the elapsed time to reflect the real amount of
> + * computation
> + */
> + delta = cap_scale(delta, arch_scale_freq_capacity(NULL, cpu));
> + delta = cap_scale(delta, arch_scale_cpu_capacity(NULL, cpu));
> +
> + /*
> + * Track the amount of stolen idle time due to running at
> + * lower capacity
> + */
> + sa->stolen_idle_time -= delta;
OK so far so good, this tracks, in stolen_idle_time, the 'lost' bit from
above.
> + } else if (!weight) {
> + if (sa->util_sum < (LOAD_AVG_MAX * 1000)) {
But here I'm completely lost. WTF just happened ;-)
Firstly, I think we want a comment on why we care about the !weight
case. Why isn't !running sufficient?
Secondly, what's up with the util_sum < LOAD_AVG_MAX * 1000 thing?
Is that to deal with cpu_capacity?
> + /*
> + * Add the idle time stolen by running at lower compute
> + * capacity
> + */
> + delta += sa->stolen_idle_time;
> + }
> + sa->stolen_idle_time = 0;
> + }
> +
> + return delta;
> +}
Thirdly, I'm thinking this isn't quite right. Imagine a task that's
running across a decay window, then we'll only add back the stolen_idle
time in the next window, even though it should've been in this one,
right?