Re: [PATCH 2/2] sched/fair: util_est: add running_sum tracking

From: Joel Fernandes
Date: Tue Jun 05 2018 - 15:33:25 EST


On Tue, Jun 05, 2018 at 04:21:56PM +0100, Patrick Bellasi wrote:
[..]
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index f74441be3f44..5d54d6a4c31f 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -3161,6 +3161,8 @@ accumulate_sum(u64 delta, int cpu, struct sched_avg *sa,
> > > sa->runnable_load_sum =
> > > decay_load(sa->runnable_load_sum, periods);
> > > sa->util_sum = decay_load((u64)(sa->util_sum), periods);
> > > + if (running)
> > > + sa->running_sum = decay_load(sa->running_sum, periods);
> > >
> > > /*
> > > * Step 2
> > > @@ -3176,8 +3178,10 @@ accumulate_sum(u64 delta, int cpu, struct sched_avg *sa,
> > > sa->load_sum += load * contrib;
> > > if (runnable)
> > > sa->runnable_load_sum += runnable * contrib;
> > > - if (running)
> > > + if (running) {
> > > sa->util_sum += contrib * scale_cpu;
> > > + sa->running_sum += contrib * scale_cpu;
> > > + }
> > >
> > > return periods;
> > > }
> > > @@ -3963,6 +3967,12 @@ static inline void util_est_enqueue(struct cfs_rq *cfs_rq,
> > > WRITE_ONCE(cfs_rq->avg.util_est.enqueued, enqueued);
> > > }
> >
> > PELT changes look nice and makes sense :)
>
> That's not strictly speaking a PELT change... it's still more in the
> idea to work "on top of PELT" to make it more effective in measuring
> the tasks expected required CPU bandwidth.

I meant "PELT change" as in change to the code that calculates PELT signals..

> > > +static inline void util_est_enqueue_running(struct task_struct *p)
> > > +{
> > > + /* Initilize the (non-preempted) utilization */
> > > + p->se.avg.running_sum = p->se.avg.util_sum;
> > > +}
> > > +
> > > /*
> > > * Check if a (signed) value is within a specified (unsigned) margin,
> > > * based on the observation that:
> > > @@ -4018,7 +4028,7 @@ util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p, bool task_sleep)
> > > * Skip update of task's estimated utilization when its EWMA is
> > > * already ~1% close to its last activation value.
> > > */
> > > - ue.enqueued = (task_util(p) | UTIL_AVG_UNCHANGED);
> > > + ue.enqueued = p->se.avg.running_sum / LOAD_AVG_MAX;
> >
> > I guess we are doing extra division here which adds some cost. Does
> > performance look Ok with the change?
>
> This extra division is there and done only at dequeue time instead of
> doing it at each update_load_avg.

I know. :)

> To be more precise, at each ___update_load_avg we should really update
> running_avg by:
>
> u32 divider = LOAD_AVG_MAX - 1024 + sa->period_contrib;
> sa->running_avg = sa->running_sum / divider;
>
> but, this would imply tracking an additional signal in sched_avg and
> doing an additional division at ___update_load_avg() time.
>
> Morten suggested that, if we accept the rounding errors due to
> considering
>
> divider ~= LOAD_AVG_MAX
>
> thus discarding the (sa->period_contrib - 1024) correction, then we
> can completely skip the tracking of running_avg (thus saving space in
> sched_avg) and approximate it at dequeue time as per the code line,
> just to compute the new util_est sample to accumulate.
>
> Does that make sense now?

The patch always made sense to me.. I was just pointing out the extra
division this patch adds. I agree since its done on dequeue-only, then its
probably Ok to do..

thanks,

- Joel