Re: [PATCH v5 2/2] sched/fair: update scale invariance of PELT

From: Vincent Guittot
Date: Tue Nov 06 2018 - 09:28:02 EST


On Mon, 5 Nov 2018 at 15:59, Morten Rasmussen <morten.rasmussen@xxxxxxx> wrote:
>
> On Mon, Nov 05, 2018 at 10:10:34AM +0100, Vincent Guittot wrote:
> > On Fri, 2 Nov 2018 at 16:36, Dietmar Eggemann <dietmar.eggemann@xxxxxxx> wrote:
> > >
...
> > > >
> > > > In order to achieve this time scaling, a new clock_pelt is created per rq.
> > > > The increase of this clock scales with current capacity when something
> > > > is running on rq and synchronizes with clock_task when rq is idle. With
> > > > this mecanism, we ensure the same running and idle time whatever the
> > > > current capacity.
> > >
> > > Thinking about this new approach on a big.LITTLE platform:
> > >
> > > CPU Capacities big: 1024 LITTLE: 512, performance CPUfreq governor
> > >
> > > A 50% (runtime/period) task on a big CPU will become an always running
> > > task on the little CPU. The utilization signal of the task and the
> > > cfs_rq of the little CPU converges to 1024.
> > >
> > > With contrib scaling the utilization signal of the 50% task converges to
> > > 512 on the little CPU, even it is always running on it, and so does the
> > > one of the cfs_rq.
> > >
> > > Two 25% tasks on a big CPU will become two 50% tasks on a little CPU.
> > > The utilization signal of the tasks converges to 512 and the one of the
> > > cfs_rq of the little CPU converges to 1024.
> > >
> > > With contrib scaling the utilization signal of the 25% tasks converges
> > > to 256 on the little CPU, even they run each 50% on it, and the one of
> > > the cfs_rq converges to 512.
> > >
> > > So what do we consider system-wide invariance? I thought that e.g. a 25%
> > > task should have a utilization value of 256 no matter on which CPU it is
> > > running?
> > >
> > > In both cases, the little CPU is not going idle whereas the big CPU does.
> >
> > IMO, the key point here is that there is no idle time. As soon as
> > there is no idle time, you don't know if a task has enough compute
> > capacity so you can't make difference between the 50% running task or
> > an always running task on the little core.
> > That's also interesting to noticed that the task will reach the always
> > running state after more than 600ms on little core with utilization
> > starting from 0.
> >
> > Then considering the system-wide invariance, the task are not really
> > invariant. If we take a 50% running task that run 40ms in a period of
> > 80ms, the max utilization of the task will be 721 on the big core and
> > 512 on the little core.
> > Then, if you take a 39ms running task instead, the utilization on the
> > big core will reach 709 but it will be 507 on little core. So your
> > utilization depends on the current capacity
> > With the new proposal, the max utilization will be 709 on big and
> > little cores for the 39ms running task. For the 40ms running task, the
> > utilization will be 721 on big core. then if the task moves on the
> > little, it will reach the value 721 after 80ms, then 900 after more
> > than 160ms and 1000 after 320ms
>
> It has always been debatable what to do with utilization when there are
> no spare cycles.
>
> In Dietmar's example where two 25% tasks are put on a 512 (50%) capacity
> CPU we add just enough utilization to have no spare cycles left. One
> could argue that 25% is still the correct utilization for those tasks.
> However, we only know their true utilization because they just ran
> unconstrained on a higher capacity CPU. Once they are on the 512 capacity
> CPU we wouldn't know if the tasks grew in utilization as there are no
> spare cycles to use.
>
> As I see it, the most fundamental difference between scaling
> contribution and time for PELT is the characteristics when CPUs are
> over-utilized.

I agree that there is a big difference in the way the over utilization
state is handled

>
> With contribution scaling the PELT utilization of a task is a _minimum_
> utilization. Regardless of where the task is currently/was running (and
> provided that it doesn't change behaviour) its PELT utilization will
> approximate its _minimum_ utilization on an idle 1024 capacity CPU.

The main drawback is that the _minimum_ utilization depends on the CPU
capacity on which the task runs. The two 25% tasks on a 256 capacity
CPU will have an utilization of 128 as an example

>
> With time scaling the PELT utilization doesn't really have a meaning on
> its own. It has to be compared to the capacity of the CPU where it
> is/was running to know what the its current PELT utilization means. When

I would have said the opposite. The utilization of the task will
always reflect the same amount of work that has been already done
whatever the CPU capacity.
In fact, the new scaling mechanism uses the real amount of work that
has been already done to compute the utilization signal which is not
the case currently. This gives more information about the real amount
of worked that has been computed in the over utilization case.

> the utilization over-shoots the capacity its value is no longer
> represents utilization, it just means that it has a higher compute
> demand than is offered on its current CPU and a high value means that it
> has been suffering longer. It can't be used to predict the actual
> utilization on an idle 1024 capacity any better than contribution scaled
> PELT utilization.

I think that it provides earlier detection of over utilization and
more accurate signal for a longer time duration which can help the
load balance
Coming back to 50% task example . I will use a 50ms running time
during a 100ms period for the example below to make it easier

Starting from 0, the evolution of the utilization is:

With contribution scaling:
time 0ms 50ms 100ms 150ms 200ms
capacity
1024 0 666
512 0 333 453
When the CPU start to be over utilized (@100ms), the utilization is
already too low (453 instead of 666) and scheduler doesn't detect yet
that we are over utilized
256 0 169 226 246 252
That's even worse with this lower capacity

With time scaling,
time 0ms 50ms 100ms 150ms 200ms
capacity
1024 0 666
512 0 428 677
We know that the current capacity is not enough and the utilization
reflect the correct utilization level compare to 1024 capacity (the
666 vs 677 difference comes from the 1024us window so the last window
is not full in the case of max capacity)
256 0 234 468 564 677
At 100ms, we know that there is not enough capacity. (In fact we know
that at 56ms). And even at time 200ms, the amount of work is exactly
what would have been executed on a CPU 4x faster

>
> This change might not be a showstopper, but it is something to be aware
> off and take into account wherever PELT utilization is used.

The point above is clearly a big difference between the 2 approaches
of the no spare cycle case but I think it will help by giving more
information in the over utilization case

Vincent
>
> Morten