Re: [PATCH] fair/util_est: Separate util_est_dequeue() for cfs_rq_util_change
From: Xuewen Yan
Date: Tue Dec 15 2020 - 07:58:01 EST
On Tue, Dec 15, 2020 at 5:39 PM Vincent Guittot
<vincent.guittot@xxxxxxxxxx> wrote:
>
> On Mon, 14 Dec 2020 at 19:46, Dietmar Eggemann <dietmar.eggemann@xxxxxxx> wrote:
> >
> > On 11/12/2020 13:03, Ryan Y wrote:
> > > Hi Dietmar,
> > >
> > > Yes! That's exactly what I meant.
> > >
> > >> The issue is that sugov_update_[shared\|single] -> sugov_get_util() ->
> > >> cpu_util_cfs() operates on an old cfs_rq->avg.util_est.enqueued value?
> > >
> > > well, because of this, when the p dequeued, _task_util_est(p) should be
> > > subtracted before cfs_rq_util_change().
> > > however, the original util_est_dequeue() dequeue the util_est and update
> > > the
> > > p->se.avg.util_est together.
> > > so I separate the original util_est_dequeue() to deal with the issue.
> >
> > OK, I see.
> >
> > I ran a testcase '50% periodic task 'task0-0' (8ms/16ms)' with
> > PELT + proprietary trace events within dequeue_task_fair() call:
> >
> > task0-0-1710 [002] 218.215535: sched_pelt_se: cpu=2 path=(null) comm=task0-0 pid=1710 load=596 runnable=597 util=597 update_time=218123022336
> > task0-0-1710 [002] 218.215536: sched_pelt_cfs: cpu=2 path=/ load=597 runnable=597 util=597 update_time=218123022336
> > task0-0-1710 [002] 218.215538: bprint: sugov_get_util: CPU2 rq->cfs.avg.util_avg=597 rq->cfs.avg.util_est.enqueued=601
> > task0-0-1710 [002] 218.215540: sched_util_est_cfs: cpu=2 path=/ enqueued=0 ewma=0 util=597
> > task0-0-1710 [002] 218.215542: bprint: dequeue_task_fair: CPU2 [task0-0 1710] rq->cfs.avg.util_avg=[576->597] rq->cfs.avg.util_est.enqueued=[601->0]
> >
> > It's true that 'sugov_get_util() -> cpu_util_cfs()' can use
> > rq->cfs.avg.util_est.enqueued before _task_util_est(p) is subtracted
> > from it.
> >
> > But isn't rq->cfs.avg.util_est.enqueued (in this case 601) always close
> > to rq->cfs.avg.util_avg (597) since the task was just running?
> > The cfs_rq utilization contains a blocked (sleeping) task.
>
> There will be a difference if the task alternates long and short runs
> in which case util_avg is lower than util_est. But even in this case,
> the freq will be update at next enqueue/dequeue/tick.
> The only real case could be when cpu goes idle in shallow state (WFI)
> which is impacted by the freq/voltage. But even in this case, the
> situation should not be long
>
> That being said, I agree that the value used by schedutil is not
> correct at dequeue
>
Hi Dietmar,
as Vincent said, like the following scenario:
running sleep
running sleep
|---------------------------------------|
|--------|
^^
at the ^^ time, the util_avg is lower than util_est.
we hope that the CPU frequency would be reduced as soon as possible after
the task is dequeued. But the issue affects the sensitivity of cpu
frequency reduce.
worse, since the time, if there is no task enqueue or other scenarios where the
load is updated, the cpu may always maintain a high frequency.
if keep the util_est_dequeue as it is, are there other concerns,
or this patch would affect other places of system?
> >
> > If I would run with your patch cpu_util_cfs() would chose between 597 and 0
> > whereas without it does between 597 and 601.
> >
> > Do you have a specific use case in mind? Or even test results showing a benefit
> > of your patch?
> >
> > > Dietmar Eggemann <dietmar.eggemann@xxxxxxx> 于2020年12月11日周五 下午7:30写道:
> > >
> > >> Hi Yan,
> > >>
> > >> On 09/12/2020 11:44, Xuewen Yan wrote:
> > >>> when a task dequeued, it will update it's util, and cfs_rq_util_change
> > >>> would check rq's util, if the cfs_rq->avg.util_est.enqueued is bigger
> > >>> than cfs_rq->avg.util_avg, but because the cfs_rq->avg.util_est.enqueued
> > >>> didn't be decreased, this would cause bigger cfs_rq_util by mistake,
> > >>> as a result, cfs_rq_util_change may change freq unreasonablely.
> > >>>
> > >>> separate the util_est_dequeue() into util_est_dequeue() and
> > >>> util_est_update(), and dequeue the _task_util_est(p) before update util.
> > >>
> > >> The issue is that sugov_update_[shared\|single] -> sugov_get_util() ->
> > >> cpu_util_cfs() operates on an old cfs_rq->avg.util_est.enqueued value?
> > >>
> > >> cpu_util_cfs()
> > >>
> > >> if (sched_feat(UTIL_EST))
> > >> util = max_t(util, READ_ONCE(rq->cfs.avg.util_est.enqueued))
> > >> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > >>
> > >> dequeue_task_fair() (w/ your patch, moving (1) before (2))
> > >>
> > >> /* (1) update cfs_rq->avg.util_est.enqueued */
> > >> util_est_dequeue()
> > >>
> > >> /* (2) potential p->se.avg.util_avg update */
> > >> /* 2 for loops */
> > >> for_each_sched_entity()
> > >>
> > >> /* this can only lead to a freq change for a root cfs_rq */
> > >> (dequeue_entity() ->) update_load_avg() -> cfs_rq_util_change()
> > >> -> cpufreq_update_util() ->...-> sugov_update_[shared\|single]
> > >>
> > >> /* (3) potential update p->se.avg.util_est */
> > >> util_est_update()
> > >>
> > >>
> > >> We do need (3) after (2) because of:
> > >>
> > >> util_est_update()
> > >> ...
> > >> ue.enqueued = (task_util(p) | UTIL_AVG_UNCHANGED); task_util
> > >> ... ^^^^^^^^^^^^^
> > >> p->se.avg.util_avg
> > >>
> > >>
> > >> Did I get this right?
> > >>
> > >> [...]