Re: [PATCH 2/2] sched/fair: util_est: add running_sum tracking

From: Patrick Bellasi
Date: Tue Jun 05 2018 - 11:11:38 EST


Hi Vincent,

On 05-Jun 08:57, Vincent Guittot wrote:
> On 4 June 2018 at 18:06, Patrick Bellasi <patrick.bellasi@xxxxxxx> wrote:

[...]

> > Let's improve the estimated utilization by adding a new "sort-of" PELT
> > signal, explicitly only for SE which has the following behavior:
> > a) at each enqueue time of a task, its value is the (already decayed)
> > util_avg of the task being enqueued
> > b) it's updated at each update_load_avg
> > c) it can just increase, whenever the task is actually RUNNING on a
> > CPU, while it's kept stable while the task is RUNNANBLE but not
> > actively consuming CPU bandwidth
> >
> > Such a defined signal is exactly equivalent to the util_avg for a task
> > running alone on a CPU while, in case the task is preempted, it allows
> > to know at dequeue time how much would have been the task utilization if
> > it was running alone on that CPU.
>
> I don't agree with this statement above
> Let say that you have 2 periodic tasks which wants to run 4ms in every
> period of 10ms and wakes up at the same time.
> One task will execute 1st and the other will wait but at the end they
> have the same period and running time and as a result the same
> util_avg which is exactly what they need.

In your example above you say that both tasks will end up with a 40%
util_avg, and that's exactly also the value reported by the new
"running_avg" metric.

Both tasks, if they where running alone, they would have generated a
40% utilization, and above I'm saying:

it allows to know at dequeue time
how much would have been the task utilization
if it it was running alone on that CPU

Don't see where is not correct, maybe I should explain it better?

> > This new signal is named "running_avg", since it tracks the actual
> > RUNNING time of a task by ignoring any form of preemption.
>
> In fact, you still take into account this preemption as you remove
> some time whereas it's only a shift of the excution

When the task is enqueued we cache the (already decayed) util_avg, and
from this time on the running_avg can only increase. It increases only
for the portion of CPU time the task is running and it's never decayed
while the task is preempted.

In your example above, if we look at the second task running, the one
"delayed", you will have:

@t1 wakeup time: running_avg@t1 = util_avg@t1
since we initialize it with the
"sleep decayed" util_avg

NOTE: this initialization is the important part, more on that on your
next comment below related to the period being changed.

@t2 switch_in time: running_avg@t2 = running_avg@t1
since it's not decayed while RUNNUBLE but !RUNNING

while, meanwhile:

util_avg@t2 < util_avg@t1
since it's decayed while RUNNABLE but !RUNNING

@t3 dequeue time: running_avg@t3 > running_avg@t2


When you say "it's only a shift of the execution" I agree, and indeed
the running_avg is not affected at all.

So, we can say that I'm "accounting for preemption time" but really,
the way I do it, is just by _not decaying_ PELT when the task is:

RUNNABLE but !RUNNING

and that's why I say that running_avg gives you the same behavior of
util_avg _if_ the task was running alone, because in that case you never
have the condition above, and thus util_avg is never decayed.

[...]

> > @@ -3161,6 +3161,8 @@ accumulate_sum(u64 delta, int cpu, struct sched_avg *sa,
> > sa->runnable_load_sum =
> > decay_load(sa->runnable_load_sum, periods);
> > sa->util_sum = decay_load((u64)(sa->util_sum), periods);
> > + if (running)
> > + sa->running_sum = decay_load(sa->running_sum, periods);
>
> so you make some time disappearing from the equation as the signal
> will not be decayed and make the period looking shorter than reality.

Since at enqueue time we always initialize running_avg to whatever is
util_avg, I don't think we are messing up with the period at all.

The util_avg signal properly accounts for the period.
Thus, the combined effect of:

- initializing running_avg at enqueue time with the value of
util_avg, already decayed to properly account for the task period
- not decaying running_avg when the task is RUNNABLE but !RUNNING

should just result into "virtually" considering the two tasks of your
example "as if" they was running concurrently on two different CPUs.

Isn't it?

> With the example I mentioned above, the 2nd task will be seen as if
> its period becomes 6ms instead of 10ms which is not true and the
> utilization of the task is overestimated without any good reason

I don't see that overestimation... and I cannot even measure it.

If I run an experiment with your example above, while using the
performance governor to rule out any possible scale invariance
difference, here is what I measure:

Task1 (40ms delayed by the following Task2):
mean std max
running_avg 455.387449 22.940168 492.0
util_avg 433.233288 17.395477 458.0

Task2 (waking up at same time of Task1 and running before):
mean std max
running_avg 430.281834 22.405175 455.0
util_avg 421.745331 22.098873 456.0

and if I compare Task1 above with another experiment where Task1 is
running alone:

Task1 (running alone):
mean std min
running_avg 460.257895 22.103704 460.0
util_avg 435.119737 17.647556 461.0


Thus, there are some variations ok, but I think they are more due to
the rounding errors we have.
Task1 is still seen as a 455 expected utilization, which is not the
4/6 ~= 660 you say above if it should be accounted as a task running
4ms every 6ms.

> Furthermore, this new signal doesn't have clear meaning because it's
> not utilization but it's also not the waiting time of the task.

The meaning is: the utilization of the task _if_ it should be running
alone on that CPU. Tehe numbers above shows that since
455 (Task1 delayed) ~= 460 (Task1 running alone)

If it's running alone it would be exactly util_avg (minus rounding
errors), but if it's preempted it will be a better representation of
the task's needed CPU bandwidth.

Unless we really need to track the "waiting time of a task" I would
say that the signal above should make util_est much more accurate on
knowing, at wakeup time, how much CPU a task will likely need...
independently from preemption sources.

Here are some other experiments / workloads I'm testing:

https://gist.github.com/derkling/fe01e5ddf639e18b5d85a3b1d2e84b7d

> >
> > /*
> > * Step 2
> > @@ -3176,8 +3178,10 @@ accumulate_sum(u64 delta, int cpu, struct sched_avg *sa,
> > sa->load_sum += load * contrib;
> > if (runnable)
> > sa->runnable_load_sum += runnable * contrib;
> > - if (running)
> > + if (running) {
> > sa->util_sum += contrib * scale_cpu;
> > + sa->running_sum += contrib * scale_cpu;
> > + }
> >
> > return periods;
> > }
> > @@ -3963,6 +3967,12 @@ static inline void util_est_enqueue(struct cfs_rq *cfs_rq,
> > WRITE_ONCE(cfs_rq->avg.util_est.enqueued, enqueued);
> > }
> >
> > +static inline void util_est_enqueue_running(struct task_struct *p)
> > +{
> > + /* Initilize the (non-preempted) utilization */
> > + p->se.avg.running_sum = p->se.avg.util_sum;

This line above is what should ensure we do not mess up with the task
period.

> > +}
> > +
> > /*
> > * Check if a (signed) value is within a specified (unsigned) margin,
> > * based on the observation that:
> > @@ -4018,7 +4028,7 @@ util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p, bool task_sleep)
> > * Skip update of task's estimated utilization when its EWMA is
> > * already ~1% close to its last activation value.
> > */
> > - ue.enqueued = (task_util(p) | UTIL_AVG_UNCHANGED);
> > + ue.enqueued = p->se.avg.running_sum / LOAD_AVG_MAX;
> > last_ewma_diff = ue.enqueued - ue.ewma;
> > if (within_margin(last_ewma_diff, (SCHED_CAPACITY_SCALE / 100)))
> > return;
> > @@ -5437,6 +5447,8 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> > if (!se)
> > add_nr_running(rq, 1);
> >
> > + util_est_enqueue_running(p);
> > +
> > hrtick_update(rq);
> > }
> >
> > --
> > 2.15.1
> >

--
#include <best/regards.h>

Patrick Bellasi