Re: [PATCH] sched/fair: update scale invariance of pelt

From: Vincent Guittot
Date: Tue Dec 15 2015 - 05:21:36 EST


On 14 December 2015 at 01:26, Yuyang Du <yuyang.du@xxxxxxxxx> wrote:
> Hi Vincent,
>
> I don't quite catch what this is doing, maybe I need more time
> to ramp up to the gory detail difficult like this.
>
> Do you scale or not scale? You seem removed the scaling, but added it
> after "Remainder of delta accrued against u_0"..

I'm scaling the time before taking it in the pelt algorithm. My reply
to Morten's comment tries to explain more deeply what i'm trying to
achieve

Thanks,
Vincent

>
> Thanks,
> Yuyang
>
> On Tue, Nov 24, 2015 at 02:49:30PM +0100, Vincent Guittot wrote:
>> The current implementation of load tracking invariance scales the load
>> tracking value with current frequency and uarch performance (only for
>> utilization) of the CPU.
>>
>> One main result of the current formula is that the figures are capped by
>> the current capacity of the CPU. This limitation is the main reason of not
>> including the uarch invariance (arch_scale_cpu_capacity) in the calculation
>> of load_avg because capping the load can generate erroneous system load
>> statistic as described with this example [1]
>>
>> Instead of scaling the complete value of PELT algo, we should only scale
>> the running time by the current capacity of the CPU. It seems more correct
>> to only scale the running time because the non running time of a task
>> (sleeping or waiting for a runqueue) is the same whatever the current freq
>> and the compute capacity of the CPU.
>>
>> Then, one main advantage of this change is that the load of a task can
>> reach max value whatever the current freq and the uarch of the CPU on which
>> it run. It will just take more time at a lower freq than a max freq or on a
>> "little" CPU compared to a "big" one. The load and the utilization stay
>> invariant across system so we can still compared them between CPU but with
>> a wider range of values.
>>
>> With this change, we don't have to test if a CPU is overloaded or not in
>> order to use one metric (util) or another (load) as all metrics are always
>> valid.
>>
>> I have put below some examples of duration to reach some typical load value
>> according to the capacity of the CPU with current implementation
>> and with this patch.
>>
>> Util (%) max capacity half capacity(mainline) half capacity(w/ patch)
>> 972 (95%) 138ms not reachable 276ms
>> 486 (47.5%) 30ms 138ms 60ms
>> 256 (25%) 13ms 32ms 26ms
>>
>> We can see that at half capacity, we need twice the duration of max
>> capacity with this patch whereas we have a non linear increase of the
>> duration with current implementation.
>>
>> [1] https://lkml.org/lkml/2014/12/18/128
>>
>> Signed-off-by: Vincent Guittot <vincent.guittot@xxxxxxxxxx>
>> ---
>> kernel/sched/fair.c | 28 +++++++++++++---------------
>> 1 file changed, 13 insertions(+), 15 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 824aa9f..f2a18e1 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -2560,10 +2560,9 @@ static __always_inline int
>> __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
>> unsigned long weight, int running, struct cfs_rq *cfs_rq)
>> {
>> - u64 delta, scaled_delta, periods;
>> + u64 delta, periods;
>> u32 contrib;
>> - unsigned int delta_w, scaled_delta_w, decayed = 0;
>> - unsigned long scale_freq, scale_cpu;
>> + unsigned int delta_w, decayed = 0;
>>
>> delta = now - sa->last_update_time;
>> /*
>> @@ -2584,8 +2583,10 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
>> return 0;
>> sa->last_update_time = now;
>>
>> - scale_freq = arch_scale_freq_capacity(NULL, cpu);
>> - scale_cpu = arch_scale_cpu_capacity(NULL, cpu);
>> + if (running) {
>> + delta = cap_scale(delta, arch_scale_freq_capacity(NULL, cpu));
>> + delta = cap_scale(delta, arch_scale_cpu_capacity(NULL, cpu));
>> + }
>>
>> /* delta_w is the amount already accumulated against our next period */
>> delta_w = sa->period_contrib;
>> @@ -2601,16 +2602,15 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
>> * period and accrue it.
>> */
>> delta_w = 1024 - delta_w;
>> - scaled_delta_w = cap_scale(delta_w, scale_freq);
>> if (weight) {
>> - sa->load_sum += weight * scaled_delta_w;
>> + sa->load_sum += weight * delta_w;
>> if (cfs_rq) {
>> cfs_rq->runnable_load_sum +=
>> - weight * scaled_delta_w;
>> + weight * delta_w;
>> }
>> }
>> if (running)
>> - sa->util_sum += scaled_delta_w * scale_cpu;
>> + sa->util_sum += delta_w << SCHED_CAPACITY_SHIFT;
>>
>> delta -= delta_w;
>>
>> @@ -2627,25 +2627,23 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
>>
>> /* Efficiently calculate \sum (1..n_period) 1024*y^i */
>> contrib = __compute_runnable_contrib(periods);
>> - contrib = cap_scale(contrib, scale_freq);
>> if (weight) {
>> sa->load_sum += weight * contrib;
>> if (cfs_rq)
>> cfs_rq->runnable_load_sum += weight * contrib;
>> }
>> if (running)
>> - sa->util_sum += contrib * scale_cpu;
>> + sa->util_sum += contrib << SCHED_CAPACITY_SHIFT;
>> }
>>
>> /* Remainder of delta accrued against u_0` */
>> - scaled_delta = cap_scale(delta, scale_freq);
>> if (weight) {
>> - sa->load_sum += weight * scaled_delta;
>> + sa->load_sum += weight * delta;
>> if (cfs_rq)
>> - cfs_rq->runnable_load_sum += weight * scaled_delta;
>> + cfs_rq->runnable_load_sum += weight * delta;
>> }
>> if (running)
>> - sa->util_sum += scaled_delta * scale_cpu;
>> + sa->util_sum += delta << SCHED_CAPACITY_SHIFT;
>>
>> sa->period_contrib += delta;
>>
>> --
>> 1.9.1
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/