Re: [PATCH v5 05/10] cpufreq/schedutil: get max utilization

From: Joel Fernandes
Date: Fri Jun 01 2018 - 13:45:36 EST


On Fri, Jun 01, 2018 at 03:53:07PM +0200, Vincent Guittot wrote:
> > >> >> The example with a RT task described in the cover letter can be
> > >> >> run with a DL task and will give similar results.
> > >
> > > In the cover letter you says:
> > >
> > > A rt-app use case which creates an always running cfs thread and a
> > > rt threads that wakes up periodically with both threads pinned on
> > > same CPU, show lot of frequency switches of the CPU whereas the CPU
> > > never goes idles during the test.
> > >
> > > I would say that's a quite specific corner case where your always
> > > running CFS task has never accumulated a util_est sample.
> > >
> > > Do we really have these cases in real systems?
> >
> > My example is voluntary an extreme one because it's easier to
> > highlight the problem
> >
> > >
> > > Otherwise, it seems to me that we are trying to solve quite specific
> > > corner cases by adding a not negligible level of "complexity".
> >
> > By complexity, do you mean :
> >
> > Taking into account the number cfs running task to choose between
> > rq->dl.running_bw and avg_dl.util_avg
> >
> > I'm preparing a patchset that will provide the cfs waiting time in
> > addition to dl/rt util_avg for almost no additional cost. I will try
> > to sent the proposal later today
>
>
> The code below adds the tracking of the waiting level of cfs tasks because of
> rt/dl preemption. This waiting time can then be used when selecting an OPP
> instead of the dl util_avg which could become higher than dl bandwidth with
> "long" runtime
>
> We need only one new call for the 1st cfs task that is enqueued to get these additional metrics
> the call to arch_scale_cpu_capacity() can be removed once the later will be
> taken into account when computing the load (which scales only with freq
> currently)
>
> For rt task, we must keep to take into account util_avg to have an idea of the
> rt level on the cpu which is given by the badnwodth for dl
>
> ---
> kernel/sched/fair.c | 27 +++++++++++++++++++++++++++
> kernel/sched/pelt.c | 8 ++++++--
> kernel/sched/sched.h | 4 +++-
> 3 files changed, 36 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index eac1f9a..1682ea7 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5148,6 +5148,30 @@ static inline void hrtick_update(struct rq *rq)
> }
> #endif
>
> +static inline void update_cfs_wait_util_avg(struct rq *rq)
> +{
> + /*
> + * If cfs is already enqueued, we don't have anything to do because
> + * we already updated the non waiting time
> + */
> + if (rq->cfs.h_nr_running)
> + return;
> +
> + /*
> + * If rt is running, we update the non wait time before increasing
> + * cfs.h_nr_running)
> + */
> + if (rq->curr->sched_class == &rt_sched_class)
> + update_rt_rq_load_avg(rq_clock_task(rq), rq, 1);
> +
> + /*
> + * If dl is running, we update the non time before increasing
> + * cfs.h_nr_running)
> + */
> + if (rq->curr->sched_class == &dl_sched_class)
> + update_dl_rq_load_avg(rq_clock_task(rq), rq, 1);
> +}
> +

Please correct me if I'm wrong but the CFS preemption-decay happens in
set_next_entity -> update_load_avg when the CFS task is scheduled again after
the preemption. Then can we not fix this issue by doing our UTIL_EST magic
from set_next_entity? But yeah probably we need to be careful with overhead..

IMO I feel its overkill to account dl_avg when we already have DL's running
bandwidth we can use. I understand it may be too instanenous, but perhaps we
can fix CFS's problems within CFS itself and not have to do this kind of
extra external accounting ?

I also feel its better if we don't have to call update_{rt,dl}_rq_load_avg
from within CFS class as being done above.

thanks,

- Joel