Re: [PATCH] sched/fair: update scale invariance of pelt

From: Morten Rasmussen
Date: Tue Dec 08 2015 - 12:04:16 EST


On Tue, Nov 24, 2015 at 02:49:30PM +0100, Vincent Guittot wrote:
> The current implementation of load tracking invariance scales the load
> tracking value with current frequency and uarch performance (only for
> utilization) of the CPU.
>
> One main result of the current formula is that the figures are capped by
> the current capacity of the CPU. This limitation is the main reason of not
> including the uarch invariance (arch_scale_cpu_capacity) in the calculation
> of load_avg because capping the load can generate erroneous system load
> statistic as described with this example [1]

The reason why we don't want to scale load_avg with regard to uarch
capacity (as we do with util_avg) is explained in
e3279a2e6d697e00e74f905851ee7cf532f72b2d as well.

> Instead of scaling the complete value of PELT algo, we should only scale
> the running time by the current capacity of the CPU. It seems more correct
> to only scale the running time because the non running time of a task
> (sleeping or waiting for a runqueue) is the same whatever the current freq
> and the compute capacity of the CPU.

You seem to imply that we currently scale running, waiting, and sleeping
time. That is not the case. We scale running and waiting time, but not
sleeping time. Whether we should scale waiting time or not is a good
question. The waiting time is affected by the running time of the other
tasks on the cfs_rq, so on one hand it seems a bit inconsistent to scale
one and not the other. On the other hand, not scaling waiting time would
make tasks that spend a lot of time waiting appear bigger, which could
an advantage as it would make load-balancing more prone to spread tasks.
A third alternative is to drop the scaling of load_avg completely, but
it is still needed for util_avg as we want util_avg to be invariant to
frequency and uarch scaling.

> Then, one main advantage of this change is that the load of a task can
> reach max value whatever the current freq and the uarch of the CPU on which
> it run. It will just take more time at a lower freq than a max freq or on a
> "little" CPU compared to a "big" one. The load and the utilization stay
> invariant across system so we can still compared them between CPU but with
> a wider range of values.

Just removing scaling of waiting time and applying scaling by current
capacity (including uarch) to the running time will not make load_avg
reach the max value for tasks running alone on a cpu. Since the task
isn't waiting at all (it is alone) all contributions are running time
which is scaled, IIUC, and hence the result is still capped by the
current capacity of the cpu. But that doesn't match your example results
further down if I read them correctly.

The changes made in the code of this patch are quite subtle, but very
important as they change the behaviour of the PELT geometric series
quite a lot. It is much more than just changing whether we scale waiting
time and apply uarch scaling to running time of load_avg or not. I
think we need to understand the math behind this patch to understand how
the PELT metrics are affected because I think this patch changes some of
the fundamentals originally described by Paul and Ben.

Instead of scaling the contribution of each 1024us segment like we
currently do, this patch is essentially warping time and lumps it
together and let it contribute fully but skips decays. It is rather hard
to explain, but the result is that the patch affects both load_avg and
util_avg, and it breaks scale-invariance.

Executive summary: Scaling time (delta) instead of the individual
segment contributions breaks scale-invariance. The net result on
load_avg seems to be nothing apart from slower reaction time.

That is how I see after having tested it a bit. But I could be getting
it all wrong. :-/


Much more detail:

Original geometric series:

\sum (0..n) u_n * y^n

Current geometric series with scale invariance:

\sum (0..n) u_n * c_n * y^n

In reality we only approximate having the capacity scaling for each
segment as don't enforce PELT updates for each capacity change due to
frequency scaling.

In this patch scaling is applied to the entire delta since last update
instead of each individual segment. That gives us a very interesting
time warping effect when updates happen less frequently than every 1ms.
On cpus with reduced capacity the delta is reduced and all the math is
done as if less time had passed since last update which introduces an
error with regard to the decay of the series as we segments of time with
zero contribution.

It is probably easier described with an example:

We have one periodic task with a period of 4ms. Busy time per activation
is 1ms at 100% capacity. The task has been running forever (>350ms) and
we consider the load_avg calculations at enqueue/dequeue, which is
should the most common update points for this scenario besides the tick
updates.

task states
s = sleeping
R = running (scheduled)

pelt
d = decay segment (load_avg * y, y^32 = 0.5)
[0..1024] = segment contribution (including any scaling)
U = __update_load_avg() is called

f = 100%
| 1024us | 1024us | 1024us | 1024us | 1024us | 1024us |
task | s | R | s | s | s | R |
pelt ml | d U 1024 U d | d | d U 1024 U
patch | d U 1024 U d | d | d U 1024 U

f = 33%
| 1024us | 1024us | 1024us | 1024us | 1024us | 1024us |
task | s | R | R | R | s | R |
pelt ml | d U 341y^2 | 341y | 341 U d U 341y^2 |
patch | d U 1024 | 0 | 0 U d U 1024 |

In the first case, f = 100%, the update after the busy period is
complete we decay load_avg by one period (segment) and add a
contribution of 1024. We are at 100% so it is a full contribution for
this segment both with and without this patch. The task enqueue update
accounts for the sleeping time by decaying load_avg three periods. The
same in both cases. We could say that the contributions of a full cycle
of the the task is:

f_100% cycle = 1024 + decay(4)

If we reduce the capacity to 33%, things look a bit different. In
mainline, the dequeque update after the busy period would decay three
periods and add \sum (i = 2..0) 0.33*1024*y^i to account for the three
busy segments. The enqueue update decays the load_avg by one segment.
The full cycle contribution becomes:

Mainline:
f_33% cycle = 341*y^2 + 341*y + 341 + decay(4)

With this patch it is different. At the dequeue update we scale the time
delta instead of the contribution, such that delta = 0.33*delta, so the
calculation is based on only one period (segment) has passed. Hence we
decay by one segment and add 1024, but still set the update point to the
true timestamp so the following update doesn't take the two remaining
segments into account. The enqueue update decays the load_avg by one
segment, just like it does in mainline. The full cycle contribution
becomes:

Patch:
f_33% cycle = 1024 + decay(2)

This is clearly different from mainline. Not only is the busy
contribution higher, 1024 > 341*y^2 + 341*y + 341, since y < 1, but we
also decay less. The result is an inflation of the load_avg and util_avg
metrics for tasks that run for more than 1ms at the time if
__update_load_avg() isn't called every 1ms.

I did a quick test to confirm this using a single periodic task and
changing the compute capacity.

util_avg
capacity mainline patch
1024 ~359 ~352
512 ~340 ~534

Execution time went from 1.4ms to 2.8ms per activation without
overloading the cpu.

The fundamental idea in scale invariance is that util_avg should be
comparable between cpu at any capacity as long none of them are
over-utilized. This isn't preserved by the patch in its current form.

> With this change, we don't have to test if a CPU is overloaded or not in
> order to use one metric (util) or another (load) as all metrics are always
> valid.

I'm not sure what you mean by always valid. util_avg is still not a
meaningful metric for tasks running on over-utilized cpus, so it can not
be used unconditionally. If util_avg > capacity we still have no clue if
the task can fit on a different cpu with higher capacity.

> I have put below some examples of duration to reach some typical load value
> according to the capacity of the CPU with current implementation
> and with this patch.
>
> Util (%) max capacity half capacity(mainline) half capacity(w/ patch)
> 972 (95%) 138ms not reachable 276ms
> 486 (47.5%) 30ms 138ms 60ms
> 256 (25%) 13ms 32ms 26ms

I assume that these are numbers for util_avg and not load_avg as said in
the text above. It confuses me a little bit as you started out by
talking about the lack of uarch scaling of load_avg and propose to
change that, not util_avg.

The equivalent table for load_avg would something like this:

load_avg (%) max capacity half capacity(mainline) half capacity(w/ patch)
972 (95%) 138ms 138ms 276ms
486 (47.5%) 30ms 30ms 60ms
256 (25%) 13ms 13ms 26ms

load_avg does reach max capacity as it is. The patch just makes it
happen at a slower pace, which I'm not sure is a good or bad thing.

> We can see that at half capacity, we need twice the duration of max
> capacity with this patch whereas we have a non linear increase of the
> duration with current implementation.

Is it a problem that the time to reach a certain value is not linear?

It is still somewhat unclear to me why we would want this change. Adding
uarch scaling to load_avg but then modify the geometric series so the
end result is the same except that it now reacts slower at lower
capacities seems a bit strange.

>
> [1] https://lkml.org/lkml/2014/12/18/128
>
> Signed-off-by: Vincent Guittot <vincent.guittot@xxxxxxxxxx>
> ---
> kernel/sched/fair.c | 28 +++++++++++++---------------
> 1 file changed, 13 insertions(+), 15 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 824aa9f..f2a18e1 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2560,10 +2560,9 @@ static __always_inline int
> __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
> unsigned long weight, int running, struct cfs_rq *cfs_rq)
> {
> - u64 delta, scaled_delta, periods;
> + u64 delta, periods;
> u32 contrib;
> - unsigned int delta_w, scaled_delta_w, decayed = 0;
> - unsigned long scale_freq, scale_cpu;
> + unsigned int delta_w, decayed = 0;
>
> delta = now - sa->last_update_time;
> /*
> @@ -2584,8 +2583,10 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
> return 0;
> sa->last_update_time = now;
>
> - scale_freq = arch_scale_freq_capacity(NULL, cpu);
> - scale_cpu = arch_scale_cpu_capacity(NULL, cpu);
> + if (running) {
> + delta = cap_scale(delta, arch_scale_freq_capacity(NULL, cpu));
> + delta = cap_scale(delta, arch_scale_cpu_capacity(NULL, cpu));

This is where the time warping happens. delta is used to determine the
number of periods (segments) since last update. Scaling this, as opposed
to the contributions for each segment individually, can lead to
disappearing segments.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/