Re: [PATCH] sched/fair: update scale invariance of pelt

From: Vincent Guittot
Date: Tue Dec 15 2015 - 05:19:27 EST


Hi Morten,

Thanks for the review and sorry for the late reply

On 8 December 2015 at 18:04, Morten Rasmussen <morten.rasmussen@xxxxxxx> wrote:
> On Tue, Nov 24, 2015 at 02:49:30PM +0100, Vincent Guittot wrote:
>> The current implementation of load tracking invariance scales the load
>> tracking value with current frequency and uarch performance (only for
>> utilization) of the CPU.
>>
>> One main result of the current formula is that the figures are capped by
>> the current capacity of the CPU. This limitation is the main reason of not
>> including the uarch invariance (arch_scale_cpu_capacity) in the calculation
>> of load_avg because capping the load can generate erroneous system load
>> statistic as described with this example [1]
>
> The reason why we don't want to scale load_avg with regard to uarch
> capacity (as we do with util_avg) is explained in
> e3279a2e6d697e00e74f905851ee7cf532f72b2d as well.
>
>> Instead of scaling the complete value of PELT algo, we should only scale
>> the running time by the current capacity of the CPU. It seems more correct
>> to only scale the running time because the non running time of a task
>> (sleeping or waiting for a runqueue) is the same whatever the current freq
>> and the compute capacity of the CPU.
>
> You seem to imply that we currently scale running, waiting, and sleeping
> time. That is not the case. We scale running and waiting time, but not
> sleeping time. Whether we should scale waiting time or not is a good

In fact, I was referring to the same equation than you use below

\sum (0..n) u_n * c_n * y^n

to say that the complete value of PELT is scaled because we scale all
u_n which include idle fractions.
Note that this doesn't change anything at the end as u_n is null during idle.


> question. The waiting time is affected by the running time of the other
> tasks on the cfs_rq, so on one hand it seems a bit inconsistent to scale
> one and not the other. On the other hand, not scaling waiting time would
> make tasks that spend a lot of time waiting appear bigger, which could
> an advantage as it would make load-balancing more prone to spread tasks.
> A third alternative is to drop the scaling of load_avg completely, but

I don't think it's a good idea to limit the usage of load_avg to
system that are overloaded and at the opposite to limit the util for
not overloaded system. The boundary between both states is rarely
clear and you often have part of the system that can be overloaded
while the other part is not.

> it is still needed for util_avg as we want util_avg to be invariant to
> frequency and uarch scaling.
>
>> Then, one main advantage of this change is that the load of a task can
>> reach max value whatever the current freq and the uarch of the CPU on which
>> it run. It will just take more time at a lower freq than a max freq or on a
>> "little" CPU compared to a "big" one. The load and the utilization stay
>> invariant across system so we can still compared them between CPU but with
>> a wider range of values.
>
> Just removing scaling of waiting time and applying scaling by current
> capacity (including uarch) to the running time will not make load_avg
> reach the max value for tasks running alone on a cpu. Since the task
> isn't waiting at all (it is alone) all contributions are running time
> which is scaled, IIUC, and hence the result is still capped by the
> current capacity of the cpu. But that doesn't match your example results
> further down if I read them correctly.

In the current implementation, we scale the full contribution of each
fraction of time in the PELT equation so if the capacity of a CPU
can't be larger than Clocal_max because of frequency scaling and/or
uarch, we have
\sum (0..n) u_n * c_n * y^n <= Clocal_max * \sum (0..n) u_n * y^n

With the proposed way to take into account the uarch and the current
frequency, we scale the time that elapses before accounting it into a
segment of the equation. As a summary, the delta time is scaled to
reflect the amount of time that would have been used at the max
capacity of the system. So if the frequency is half max freq, the time
that will be accounted, will be half the really elapsed time. In
parallel, the duration of the job will be twice longer so we will have
the same amount of time accounted at the end.

With this patch, the PELT equation stays \sum (0..n) u_n * y^n
whatever the uarch and the current frequency. The main benefits is
that we can reach the max value whatever uarch and current freq. The
impact of the uarch and the current frequency is taken into account
before the equation when we are accounting the time into a segment.

>
> The changes made in the code of this patch are quite subtle, but very
> important as they change the behaviour of the PELT geometric series
> quite a lot. It is much more than just changing whether we scale waiting
> time and apply uarch scaling to running time of load_avg or not. I
> think we need to understand the math behind this patch to understand how
> the PELT metrics are affected because I think this patch changes some of
> the fundamentals originally described by Paul and Ben.

As explained above, the PELT equation in itself will be no more
impacted by freq and uarch as their impacts are taken into account
outside.

>
> Instead of scaling the contribution of each 1024us segment like we
> currently do, this patch is essentially warping time and lumps it
> together and let it contribute fully but skips decays. It is rather hard
> to explain, but the result is that the patch affects both load_avg and
> util_avg, and it breaks scale-invariance.
>
> Executive summary: Scaling time (delta) instead of the individual
> segment contributions breaks scale-invariance. The net result on
> load_avg seems to be nothing apart from slower reaction time.
>
> That is how I see after having tested it a bit. But I could be getting
> it all wrong. :-/

For me it's not slowing the reaction time but reflecting more
accurately the real behavior.
Let take the example of a task with a computation that take 10ms at
max capacity.
At max capacity, the job will run 10ms and the util value will be
199 as well as the load_avg.
At half frequency, the job will run 20ms instead of 10ms . With the
current scale-invariance implementation, the util_avg value will be
180 as well as the load_avg. The load_avg value would have been 360 if
we remove all kind of scale-invariance as you proposed above.
With the proposed implementation, the util value will be 199 as well
as the load_avg because we will add the same amount segment .

The PELT implementation is about calculating the load/utilization of a
task/CPU. It uses the time to reflect the amount of work done by a
task. In a system has the same fix compute capacity per second for all
cpus, it's fine to only use the time. But when we have different
compute capacity across the system, we have to reflect this difference
in the time that is added to a segment.

>
>
> Much more detail:
>
> Original geometric series:
>
> \sum (0..n) u_n * y^n
>
> Current geometric series with scale invariance:
>
> \sum (0..n) u_n * c_n * y^n
>
> In reality we only approximate having the capacity scaling for each
> segment as don't enforce PELT updates for each capacity change due to
> frequency scaling.
>
> In this patch scaling is applied to the entire delta since last update

we probably don't have the same meaning of the last update but that's
exactly the same as the current implementation

> instead of each individual segment. That gives us a very interesting
> time warping effect when updates happen less frequently than every 1ms.
> On cpus with reduced capacity the delta is reduced and all the math is
> done as if less time had passed since last update which introduces an
> error with regard to the decay of the series as we segments of time with
> zero contribution.

This happen because the compute capacity is lower and the actual work
done during this segment of time is lower too. The duration of the
computation will be longer and at the end, we will have the same
amount of segments of time for the job

>
> It is probably easier described with an example:
>
> We have one periodic task with a period of 4ms. Busy time per activation
> is 1ms at 100% capacity. The task has been running forever (>350ms) and

It's not clear for me why you want a task that was running forever
before the use case ? Apart from starting at max value or more
precisely 33% of max value when f=33% ?

> we consider the load_avg calculations at enqueue/dequeue, which is
> should the most common update points for this scenario besides the tick
> updates.
>
> task states
> s = sleeping
> R = running (scheduled)
>
> pelt
> d = decay segment (load_avg * y, y^32 = 0.5)
> [0..1024] = segment contribution (including any scaling)
> U = __update_load_avg() is called
>
> f = 100%
> | 1024us | 1024us | 1024us | 1024us | 1024us | 1024us |
> task | s | R | s | s | s | R |
> pelt ml | d U 1024 U d | d | d U 1024 U
> patch | d U 1024 U d | d | d U 1024 U
>
> f = 33%
> | 1024us | 1024us | 1024us | 1024us | 1024us | 1024us |
> task | s | R | R | R | s | R |
> pelt ml | d U 341y^2 | 341y | 341 U d U 341y^2 |
> patch | d U 1024 | 0 | 0 U d U 1024 |
>
> In the first case, f = 100%, the update after the busy period is
> complete we decay load_avg by one period (segment) and add a
> contribution of 1024. We are at 100% so it is a full contribution for
> this segment both with and without this patch. The task enqueue update
> accounts for the sleeping time by decaying load_avg three periods. The
> same in both cases. We could say that the contributions of a full cycle
> of the the task is:
>
> f_100% cycle = 1024 + decay(4)
>
> If we reduce the capacity to 33%, things look a bit different. In
> mainline, the dequeque update after the busy period would decay three
> periods and add \sum (i = 2..0) 0.33*1024*y^i to account for the three
> busy segments. The enqueue update decays the load_avg by one segment.
> The full cycle contribution becomes:
>
> Mainline:
> f_33% cycle = 341*y^2 + 341*y + 341 + decay(4)
>
> With this patch it is different. At the dequeue update we scale the time
> delta instead of the contribution, such that delta = 0.33*delta, so the
> calculation is based on only one period (segment) has passed. Hence we
> decay by one segment and add 1024, but still set the update point to the
> true timestamp so the following update doesn't take the two remaining
> segments into account. The enqueue update decays the load_avg by one
> segment, just like it does in mainline. The full cycle contribution
> becomes:
>
> Patch:
> f_33% cycle = 1024 + decay(2)
>
> This is clearly different from mainline. Not only is the busy
> contribution higher, 1024 > 341*y^2 + 341*y + 341, since y < 1, but we
> also decay less. The result is an inflation of the load_avg and util_avg

So the busy contribution is exactly the same as fmax whereas it's not
the case with current implementation as you mentioned above. But the
number of "decay" is not the same whereas the current implementation
have the same number of decay.
I have to look at how i can improve the decay accuracy.

> metrics for tasks that run for more than 1ms at the time if
> __update_load_avg() isn't called every 1ms.
>
> I did a quick test to confirm this using a single periodic task and
> changing the compute capacity.
>
> util_avg
> capacity mainline patch
> 1024 ~359 ~352
> 512 ~340 ~534
>
> Execution time went from 1.4ms to 2.8ms per activation without
> overloading the cpu.

At the opposite, there are some use cases where the proposed util_avg
is more accurate. In fact, this mainly depends of which part of the
decay or the load is preponderant in the value of util_avg/load_avg
As soon as the running time is around 100ms, we "saturate" the
load_avg or the util_avg. So a task that runs 50ms each 150ms at max
capacity will be around 695 for both util_avg and load_avg, whereas it
will be around 470 for util_avg and 940 for load_avg at half capacity
(due uarch) as the duration becomes 100ms. For this example, we have
lost the scale invariance with current implementation. With the
proposed changes, the util_avg and the load_avg would be 750.

>
> The fundamental idea in scale invariance is that util_avg should be
> comparable between cpu at any capacity as long none of them are
> over-utilized. This isn't preserved by the patch in its current form.
>
>> With this change, we don't have to test if a CPU is overloaded or not in
>> order to use one metric (util) or another (load) as all metrics are always
>> valid.
>
> I'm not sure what you mean by always valid. util_avg is still not a
> meaningful metric for tasks running on over-utilized cpus, so it can not
> be used unconditionally. If util_avg > capacity we still have no clue if
> the task can fit on a different cpu with higher capacity.

That's one side goal of changing the way the scale invariance is taken
into account in util_avg. Being > current capacity can still be
meaningful

>
>> I have put below some examples of duration to reach some typical load value
>> according to the capacity of the CPU with current implementation
>> and with this patch.
>>
>> Util (%) max capacity half capacity(mainline) half capacity(w/ patch)
>> 972 (95%) 138ms not reachable 276ms
>> 486 (47.5%) 30ms 138ms 60ms
>> 256 (25%) 13ms 32ms 26ms
>
> I assume that these are numbers for util_avg and not load_avg as said in

It can be both. half capacity can refer to frequency invariance or
uarch invariance

> the text above. It confuses me a little bit as you started out by
> talking about the lack of uarch scaling of load_avg and propose to
> change that, not util_avg.

The goal is to impact both util_avg and load_avg:
Being able to add the uarch in the calculation of the load_avg to
improve the fairness in presence of cpus with different capacity.
Being able to use the util_avg in a wider time scale.

>
> The equivalent table for load_avg would something like this:
>
> load_avg (%) max capacity half capacity(mainline) half capacity(w/ patch)
> 972 (95%) 138ms 138ms 276ms
> 486 (47.5%) 30ms 30ms 60ms
> 256 (25%) 13ms 13ms 26ms
>
> load_avg does reach max capacity as it is. The patch just makes it
> happen at a slower pace, which I'm not sure is a good or bad thing.
>
>> We can see that at half capacity, we need twice the duration of max
>> capacity with this patch whereas we have a non linear increase of the
>> duration with current implementation.
>
> Is it a problem that the time to reach a certain value is not linear?

This doesn't help in the scale-invariance

Thanks,
Vincent

>
> It is still somewhat unclear to me why we would want this change. Adding
> uarch scaling to load_avg but then modify the geometric series so the
> end result is the same except that it now reacts slower at lower
> capacities seems a bit strange.
>
>>
>> [1] https://lkml.org/lkml/2014/12/18/128
>>
>> Signed-off-by: Vincent Guittot <vincent.guittot@xxxxxxxxxx>
>> ---
>> kernel/sched/fair.c | 28 +++++++++++++---------------
>> 1 file changed, 13 insertions(+), 15 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 824aa9f..f2a18e1 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -2560,10 +2560,9 @@ static __always_inline int
>> __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
>> unsigned long weight, int running, struct cfs_rq *cfs_rq)
>> {
>> - u64 delta, scaled_delta, periods;
>> + u64 delta, periods;
>> u32 contrib;
>> - unsigned int delta_w, scaled_delta_w, decayed = 0;
>> - unsigned long scale_freq, scale_cpu;
>> + unsigned int delta_w, decayed = 0;
>>
>> delta = now - sa->last_update_time;
>> /*
>> @@ -2584,8 +2583,10 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
>> return 0;
>> sa->last_update_time = now;
>>
>> - scale_freq = arch_scale_freq_capacity(NULL, cpu);
>> - scale_cpu = arch_scale_cpu_capacity(NULL, cpu);
>> + if (running) {
>> + delta = cap_scale(delta, arch_scale_freq_capacity(NULL, cpu));
>> + delta = cap_scale(delta, arch_scale_cpu_capacity(NULL, cpu));
>
> This is where the time warping happens. delta is used to determine the
> number of periods (segments) since last update. Scaling this, as opposed
> to the contributions for each segment individually, can lead to
> disappearing segments.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/