Re: [RFC PATCH 0/1] sched/pelt: Change PELT halflife at runtime

From: Jian-Min Liu
Date: Tue Sep 20 2022 - 10:08:16 EST



Update some test data in android phone to support switching PELT HL
is helpful functionality.

We switch runtime PELT HL during runtime by difference scenario e.g.
pelt8 in playing game, pelt32 in camera video. Support runntime
switching PELT HL is flexible for different workloads.

the below table show performance & power data points:

---------------------------------------------------------------------
--| | PELT
halflife |
| |----------------------------------------------|
| | 32 | 16 | 8 |
| |----------------------------------------------|
| | avg min avg | avg min avg | avg min avg|
| Scenarios | fps fps pwr | fps fps pwr | fps fps pwr|
|---------------------------------------------------------------------|
| HOK game 60fps | 100 100 100 | 105 *134* 102 | 104 *152* 106|
| HOK game 90fps | 100 100 100 | 101 *114* 101 | 103 *129* 105|
| HOK game 120fps | 100 100 100 | 102 *124* 102 | 105 *134* 105|
| FHD video rec. 60fps | 100 100 100 | n/a n/a n/a | 100 100 103|
| Camera snapshot | 100 100 100 | n/a n/a n/a | 100 100 102|
-----------------------------------------------------------------------

HOK ... Honour Of Kings, Video game
FHD ... Full High Definition
fps ... frame per second
pwr ... power consumption

table values are in %


On Mon, 2022-08-29 at 07:54 +0200, Dietmar Eggemann wrote:
> Many of the Android devices still prefer to run PELT with a shorter
> halflife than the hardcoded value of 32ms in mainline.
>
> The Android folks claim better response time of display pipeline
> tasks
> (higher min and avg fps for 60, 90 or 120Hz refresh rate). Some of
> the
> benchmarks like PCmark web-browsing show higher scores when running
> with 16ms or 8ms PELT halflife. The gain in response time and
> performance is considered to outweigh the increase of energy
> consumption in these cases.
>
> The original idea of introducing a PELT halflife compile time option
> for 32, 16, 8ms from Patrick Bellasi in 2018
>
https://urldefense.com/v3/__https://lkml.kernel.org/r/20180409165134.707-1-patrick.bellasi@arm.com__;!!CTRNKA9wMg0ARbw!x-6IhaOmZWO5PJIWEfZLD-6grV2BwlOBpflNV57-oNZY8NfSocwlImAHM2TQFyo56_r-$
>
> wasn't integrated into mainline mainly because of breaking the PELT
> stability requirement (see (1) below).
>
> We have been experimenting with a new idea from Morten Rasmussen to
> instead introduce an additional clock between task and pelt clock.
> This
> way the effect of a shorter PELT halflife of 8ms or 16ms can be
> achieved by left-shifting the elapsed time. This is similar to the
> use
> of time shifting of the pelt clock to achieve scale invariance in
> PELT.
> The implementation is from Vincent Donnefort with some minor
> modifications to align with current tip sched/core.
>
> ---
>
> Known potential issues:
>
> (1) PELT stability requirement
>
> PELT halflife has to be larger than or equal to the scheduling
> period.
>
> The sched_period (sysctl_sched_latency) of a typical mobile device
> with
> 8 CPUs with the default logarithmical tuning is 24ms so only the 32
> ms
> PELT halflife met this. Shorter halflife of 16ms or even 8ms would
> break
> this.
>
> It looks like that this problem might not exist anymore because of
> the
> PELT rewrite in 2015, i.e. with commit 9d89c257dfb9
> ("sched/fair: Rewrite runnable load and utilization average
> tracking").
> Since then sched entities (task & task groups) and cfs_rq's are
> independently maintained rather than each entity update maintains the
> cfs_rq at the same time.
>
> This seems to mitigate the issue that the cfs_rq signal is not
> correct
> when there are not all runnable entities able ot do a self update
> during
> a PELT halflife window.
>
> That said, I'm not entirely sure whether the entity-cfs_rq
> synchronization is the only issue behind this PELT stability
> requirement.
>
>
> (2) PELT utilization versus util_est (estimated utilization)
>
> The PELT signal of a periodic task oscillates with higher peak
> amplitude
> when using smaller halflife. For a typical periodic task of the
> display
> pipeline with a runtime/period of 8ms/16ms the peak amplitude is at
> ~40
> for 32ms, at ~80 for 16ms and at ~160 for 8ms. Util_est stores the
> util_avg peak as util_est.enqueued per task.
>
> With an additional exponential weighted moving average (ewma) to
> smooth
> task utilization decreases, util_est values of the runnable tasks are
> aggregated on the root cfs_rq.
> CPU and task utilization for CPU frequency selection and task
> placement
> is the max value out of util_est and util_avg.
> I.e. because of how util_est is implemented higher CPU Operating
> Performance Points and more capable CPUs are already chosen when
> using
> smaller PELT halflife.
>
>
> (3) Wrong PELT history when switching PELT multiplier
>
> The PELT history becomes stale the moment the PELT multiplier is
> changed
> during runtime. So all decisions based on PELT are skewed for the
> time
> interval to produce LOAD_MAX_AVG (the sum of the infinite geometric
> series) which value is ~345ms for halflife=32ms (smaller for 8ms or
> 16ms).
>
> Rate limiting the PELT multiplier change to this value is not solving
> the issue here. So the user would have to live with possible
> incorrect
> discussions during these PELT multiplier transition times.
>
> ---
>
> It looks like that individual task boosting e.g. via uclamp_min,
> possibly abstracted by middleware frameworks like Android Dynamic
> Performance Framework (ADPF) would be the way to go here but until
> this
> is fully available and adopted some Android folks will still prefer
> the
> overall system boosting they achieve by running with a shorter PELT
> halflife.
>
> Vincent Donnefort (1):
> sched/pelt: Introduce PELT multiplier
>
> kernel/sched/core.c | 2 +-
> kernel/sched/pelt.c | 60
> ++++++++++++++++++++++++++++++++++++++++++++
> kernel/sched/pelt.h | 42 ++++++++++++++++++++++++++++---
> kernel/sched/sched.h | 1 +
> 4 files changed, 100 insertions(+), 5 deletions(-)
>