Re: [PATCH 0/7 v5] sched/fair: Rework EAS to handle more cases

From: Vincent Guittot
Date: Tue Apr 15 2025 - 09:51:06 EST


Hi Christian,

On Thu, 3 Apr 2025 at 14:37, Christian Loehle <christian.loehle@xxxxxxx> wrote:
>
> On 3/2/25 21:05, Vincent Guittot wrote:
> > The current Energy Aware Scheduler has some known limitations which have
> > became more and more visible with features like uclamp as an example. This
> > serie tries to fix some of those issues:
> > - tasks stacked on the same CPU of a PD
> > - tasks stuck on the wrong CPU.
> >

...

> >
> > include/linux/energy_model.h | 111 ++----
> > kernel/sched/fair.c | 721 ++++++++++++++++++++++++-----------
> > kernel/sched/sched.h | 2 +
> > 3 files changed, 518 insertions(+), 316 deletions(-)
> >
>
> Hi Vincent,
> so I've invested some time into running tests with the series.
> To further narrow down which patch we can attribute a change in
> behavior I've compared the following:
> - Patches 1 to 3 applied, comparing your proposed feec() (B)
> only to the baseline feec() (A).
> - All patches applied, using a static branch to enable (C) and
> disable (D) push mechanism for misfit tasks (if disabled only
> the 'tasks stuck on CPU' mechanism triggers here).
>
> I've looked at
> 1) YouTube 4K video playback
> 2) Dr.Arm (in-house ARM game)
> 3) VideoScroller which loads a new video every 3s
> 4) Idle screen on
> 5) Speedometer2.0 in Chromium
>
> The device tested is the Pixel6 with 6.12 kernel + backported
> scheduler patches.

What do you mean by "6.12 kernel + backported scheduler patches" ? Do
you mean android mainline v6.12 ?

I run my test with android mainline v6.13 + scheduler patches for
v6.14 and v6.15-rc1. Do you mean the same ? v6.12 misses a number of
important patches in regards to threads accounting

> For power measurements the onboard energy-meter is used [1].

same for me

>
> Mainline feec() A is the baseline for all. All workloads are run for
> 10mins with the exception of Speedometer 2.0
> (one iteration each for 5 iterations with cooldowns).

What do you mean exactly by (one iteration each for 5 iterations with
cooldowns) ?

>
> 1) YouTube 4K video

I'd like to reproduce this use case because my test with 4k video
playback shows similar or slightly better power consumption (2%) with
this patch.

Do you have details about this use case that you can share ?


> +4.5% power with all other tested (the regression already shows with B,
> no further change with C & D).
> (cf. +18.5% power with CAS).
> The power regression comes from increased average frequency on all
> 3 clusters.

I'm interested to understand why the average frequency increases as
the OPP remains the 1st level of selection and in case of light loaded
use cases we should not see much difference. That's what I see on my
4k video playback use case

And I will also look at why the CAS is better in your case

> No dropped frames in all tested A to D.
>
> 2) Dr.Arm (in-house ARM game)
> +9.9% power with all other tested (the regression already shows with B,
> no further change with C & D).
> (cf. +3.7% power with CAS, new feec() performs worse than CAS here.)
> The power regression comes from increased average frequency on all
> 3 clusters.

I supposed that I won't be able to reproduce this one

>
> 3) VideoScroller
> No difference in terms of power for A to D.
> Specifically even the push mechanism with misfit enabled/disabled
> doesn't make a noticeable difference in per-cluster energy numbers.
>
> 4) Idle screen on
> No difference in power for all for A to D.

I see a difference here mainly for DDR power consumption with 7%
saving compared to mainline and 2% on the CPU clusters

>
> 5) Speedometer2.0 in Chromium
> Both power and score comparable for A to D.
>
> As mentioned in the thread already the push mechanism
> (without misfit tasks) (D) triggers only once every 2-20 minutes,
> depending on the workload (all tested here were without any
> UCLAMP_MAX tasks).
> I also used the device manually just to check if I'm not missing
> anything here, I wasn't.
> This push task mechanism shouldn't make any difference without
> UCLAMP_MAX.

On the push mechanism side, I'm surprised that you don't get more push
than once every 2-20 minutes. On the speedometer, I've got around 170
push fair and 600 check pushable which ends with a task migration
during the 75 seconds of the test and much more calls that ends with
the same cpu. This also needs to be compared with the 70% of
overutilized state during the 75 seconds of the time during which we
don't push. On light loaded case, the condition is currently to
conservative to trigger push task mechanism but that's also expected
in order to be conservative

The fact that OU triggers too quickly limits the impact of push and feec rework

uclamp_max sees a difference with the push mechanism which is another
argument for using it.

And this is 1st step is quite conservative before extending the cases
which can benefit from push and feec rework as explained at OSPM

>
> The increased average frequency in 1) and 2) is caused by the
> deviation from max-spare-cap in feec(), which previously ensured
> as much headroom as possible until we have to raise the OPP of the
> cluster.
>
> So all in all this regresses power on some crucial EAS workloads.
> I couldn't find a real-world workload where the
> 'less co-scheduling/contention' strategy of feec() showed a benefit.
> Did you have a specific workload for this in mind?
>
> [1]
> https://tooling.sites.arm.com/lisa/latest/sections/api/generated/lisa.analysis.pixel6.Pixel6Analysis.html#lisa.analysis.pixel6.Pixel6Analysis.df_power_meter