Re: [PATCH] cpufreq: schedutil: add up/down frequency transition rate limits
From: Peter Zijlstra
Date: Mon Nov 21 2016 - 10:26:22 EST
On Mon, Nov 21, 2016 at 02:59:19PM +0000, Patrick Bellasi wrote:
> A fundamental problem in IMO is that we are trying to use a "dynamic
> metric" to act as a "predictor".
>
> PELT is a "dynamic metric" since it continuously change while a task
> is running. Thus it does not really provides an answer to the question
> "how big this task is?" _while_ the task is running.
> Such an information is available only when the task sleep.
> Indeed, only when the task completes an activation and goes to sleep
> PELT has reached a value which represents how much CPU bandwidth has
> been required by that task.
I'm not sure I agree with that. We can only tell how big a task is
_while_ its running, esp. since its behaviour is not steady-state. Tasks
can change etc..
Also, as per the whole argument on why peak_util was bad, at the moment
a task goes to sleep, the PELT signal is actually an over-estimate,
since it hasn't yet had time to average out.
And a real predictor requires a crytal-ball instruction, but until such
time that hardware people bring us that goodness, we'll have to live
with predicting the near future based on the recent past.
> For example, if we consider the simple yet interesting case of a
> periodic task, PELT is a wobbling signal which reports a correct
> measure of how much bandwidth is required only when a task completes
> its RUNNABLE status.
Its actually an over-estimate at that point, since it just added a
sizable chunk to the signal (for having been runnable) that hasn't yet
had time to decay back to the actual value.
> To be more precise, the correct value is provided by the average PELT
> and this also depends on the period of the task compared to the
> PELT rate constant.
> But still, to me a fundamental point is that the "raw PELT value" is
> not really meaningful in _each and every single point in time_.
Agreed.
> All that considered, we should be aware that to properly drive
> schedutil and (in the future) the energy aware scheduler decisions we
> perhaps need better instead a "predictor".
> In the simple case of the periodic task, a good predictor should be
> something which reports always the same answer _in each point in
> time_.
So the problem with this is that not many tasks are that periodic, and
any filter you put on top will add, lets call it, momentum to the
signal. A reluctance to change. This might negatively affect
non-periodic tasks.
In any case, worth trying, see what happens.
> For example, a task running 30 [ms] every 100 [ms] is a ~300 util_avg
> task. With PELT, we get a signal which range between [120,550] with an
> average of ~300 which is instead completely ignored. By capping the
> decay we will get:
>
> decay_cap [ms] range average
> 0 120:550 300
> 64 140:560 310
> 32 320:660 430
>
> which means that still the raw PELT signal is wobbling and never
> provides a consistent response to drive decisions.
>
> Thus, a "predictor" should be something which sample information from
> PELT to provide a more consistent view, a sort of of low-pass filter
> on top of the "dynamic metric" which is PELT.
>
> Should not such a "predictor" help on solving some of the issues
> related to PELT slow ramp-up or fast ramp-down?
I think intel_pstate recently added a local PID filter, I asked at the
time if something like that should live in generic code, looks like
maybe it should.