Re: [RFC PATCH 0/1] sched/pelt: Change PELT halflife at runtime

From: Vincent Guittot
Date: Tue Feb 21 2023 - 04:30:04 EST


On Mon, 20 Feb 2023 at 14:54, Vincent Guittot
<vincent.guittot@xxxxxxxxxx> wrote:
>
> On Fri, 17 Feb 2023 at 14:54, Dietmar Eggemann <dietmar.eggemann@xxxxxxx> wrote:
> >
> > On 09/02/2023 17:16, Vincent Guittot wrote:
> > > On Tue, 7 Feb 2023 at 11:29, Dietmar Eggemann <dietmar.eggemann@xxxxxxx> wrote:
> > >>
> > >> On 09/11/2022 16:49, Peter Zijlstra wrote:
> > >>> On Tue, Nov 08, 2022 at 07:48:43PM +0000, Qais Yousef wrote:
> > >>>> On 11/07/22 14:41, Peter Zijlstra wrote:
> > >>>>> On Thu, Sep 29, 2022 at 03:41:47PM +0100, Kajetan Puchalski wrote:
> >
> > [...]
> >
> > >> (B) *** Where does util_est_faster help exactly? ***
> > >>
> > >> It turns out that the score improvement comes from the more aggressive
> > >> DVFS request ('_freq') (1) due to the CPU util boost in sugov_get_util()
> > >> -> effective_cpu_util(..., cpu_util_cfs(), ...).
> > >>
> > >> At the beginning of an episode (e.g. beginning of an image list view
> > >> fling) when the periodic tasks (~1/16ms (60Hz) at 'max uArch'/'max CPU
> > >> frequency') of the Android Graphics Pipeline (AGP) start to run, the
> > >> CPU Operating Performance Point (OPP) is often so low that those tasks
> > >> run more like 10/16ms which let the test application count a lot of
> > >> Jankframes at those moments.
> > >
> > > I don't see how util_est_faster can help this 1ms task here ? It's
> > > most probably never be preempted during this 1ms. For such an Android
> >
> > It's 1/16ms at max CPU frequency and on a big CPU. Could be a longer
> > runtime with min CPU frequency at little CPU. I see runtime up to 10ms
> > at the beginning of a test episode.
> >
> > Like I mentioned below, it could also be that the tasks have more work
> > to do at the beginning. It's easy to spot using Google's perfetto and
> > those moments also correlate with the occurrence of jankframes. I'm not
> > yet sure how much this has to do with the perfetto instrumentation though.
> >
> > But you're right, on top of that, there is preemption (e.g. of the UI
> > thread) by other threads (render thread, involved binder threads,
> > surfaceflinger, etc.) going on. So the UI thread could be
> > running+runnable for > 20ms, again marked as a jankframe.
> >
> > > Graphics Pipeline short task, hasn't uclamp_min been designed for and
> > > a better solution ?
> >
> > Yes, it has. I'm not sure how feasible this is to do for all tasks
> > involved. I'm thinking about the Binder threads here for instance.
>
> Yes, that can probably not help for all threads but some system
> threads like surfaceflinger and graphic composer should probably
> benefit from min uclamp
>
> >
> > [...]
> >
> > >> Looks like that 'util_est_faster' can prevent Jankframes by boosting CPU
> > >> util when periodic tasks have a longer runtime compared to when they reach
> > >> steady-sate.
> > >>
> > >> The results is very similar to PELT halflife reduction. The advantage is
> > >> that 'util_est_faster' is only activated selectively when the runtime of
> > >> the current task in its current activation is long enough to create this
> > >> CPU util boost.
> > >
> > > IIUC how util_est_faster works, it removes the waiting time when
> > > sharing cpu time with other tasks. So as long as there is no (runnable
> > > but not running time), the result is the same as current util_est.
> > > util_est_faster makes a difference only when the task alternates
> > > between runnable and running slices.
> > > Have you considered using runnable_avg metrics in the increase of cpu
> > > freq ? This takes into the runnable slice and not only the running
> > > time and increase faster than util_avg when tasks compete for the same
> > > CPU
> >
> > Good idea! No, I haven't.
> >
> > I just glanced over the code, there shouldn't be an advantage in terms
> > of more recent update between `curr->sum_exec_runtime` and
> > update_load_avg(cfs_rq) even in the taskgroup case.
> >
> > Per-task view:
> >
> > https://nbviewer.org/github/deggeman/lisa/blob/ipynbs/ipynb/scratchpad/cpu_runnable_avg_boost.ipynb
> >
> >
> > All tests ran 10 iterations of all Jankbench sub-tests. (Reran the
> > `max_util_scaled_util_est_faster_rbl_freq` once with very similar
> > results. Just to make sure the results are somehow correct).
> >
> > Max_frame_duration:
> > +------------------------------------------+------------+
> > | kernel | value |
> > +------------------------------------------+------------+
> > | base-a30b17f016b0 | 147.571352 |
> > | pelt-hl-m2 | 119.416351 |
> > | pelt-hl-m4 | 96.473412 |
> > | scaled_util_est_faster_freq | 126.646506 |
> > | max_util_scaled_util_est_faster_rbl_freq | 157.974501 | <-- !!!
> > +------------------------------------------+------------+
> >
> > Mean_frame_duration:
> > +------------------------------------------+-------+-----------+
> > | kernel | value | perc_diff |
> > +------------------------------------------+-------+-----------+
> > | base-a30b17f016b0 | 14.7 | 0.0% |
> > | pelt-hl-m2 | 13.6 | -7.5% |
> > | pelt-hl-m4 | 13.0 | -11.68% |
> > | scaled_util_est_faster_freq | 13.7 | -6.81% |
> > | max_util_scaled_util_est_faster_rbl_freq | 12.1 | -17.85% |
> > +------------------------------------------+-------+-----------+
> >
> > Jank percentage (Jank deadline 16ms):
> > +------------------------------------------+-------+-----------+
> > | kernel | value | perc_diff |
> > +------------------------------------------+-------+-----------+
> > | base-a30b17f016b0 | 1.8 | 0.0% |
> > | pelt-hl-m2 | 1.8 | -4.91% |
> > | pelt-hl-m4 | 1.2 | -36.61% |
> > | scaled_util_est_faster_freq | 1.3 | -27.63% |
> > | max_util_scaled_util_est_faster_rbl_freq | 0.8 | -54.86% |
> > +------------------------------------------+-------+-----------+
> >
> > Power usage [mW] (total - all CPUs):
> > +------------------------------------------+-------+-----------+
> > | kernel | value | perc_diff |
> > +------------------------------------------+-------+-----------+
> > | base-a30b17f016b0 | 144.4 | 0.0% |
> > | pelt-hl-m2 | 141.6 | -1.97% |
> > | pelt-hl-m4 | 163.2 | 12.99% |
> > | scaled_util_est_faster_freq | 132.3 | -8.41% |
> > | max_util_scaled_util_est_faster_rbl_freq | 133.4 | -7.67% |
> > +------------------------------------------+-------+-----------+
> >
> > There is a regression in `Max_frame_duration` but `Mean_frame_duration`,
> > `Jank percentage` and `Power usage` are better.
>
> The max frame duration is interesting. Could it be the very 1st frame
> of the test ?
> It's interesting that it's even worse than baseline whereas it should
> take the max of baseline and runnable_avg
>
> >
> > So maybe DVFS boosting in preempt-scenarios is really the thing here to
> > further improve the Android Graphics Pipeline.
> >
> > I ran the same test (boosting only for DVFS requests) with:
> >
> > -->8--
> >
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index dbc56e8b85f9..7a4bf38f2920 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -2946,6 +2946,8 @@ static inline unsigned long cpu_util_cfs(int cpu)
> > READ_ONCE(cfs_rq->avg.util_est.enqueued));
> > }
> >
> > + util = max(util, READ_ONCE(cfs_rq->avg.runnable_avg));
> > +

Another reason why it gives better results could be that
cpu_util_cfs() is not only used for DVFS selection but also to track
the cpu utilization in load balance and EAS so the cpu will be faster
seen as overloaded and tasks will be spread around when there are
contentions.

Could you try to take cfs_rq->avg.runnable_avg into account only when
selecting frequency ?

That being said I can see some place in load balance where
cfs_rq->avg.runnable_avg could give some benefits like in
find_busiest_queue() where it could be better to take into account the
contention when selecting the busiest queue

> > return min(util, capacity_orig_of(cpu));
> >
> > Thanks!
> >
> > -- Dietmar
> >
> >
> >
> >
> >
> >