Re: [RFC][PATCH v021 0/9] cpufreq: intel_pstate: Enable EAS on hybrid platforms without SMT
From: Christian Loehle
Date: Sat Feb 01 2025 - 07:43:21 EST
On 1/27/25 13:57, Rafael J. Wysocki wrote:
> On Sat, Jan 25, 2025 at 12:18 PM Dietmar Eggemann
> <dietmar.eggemann@xxxxxxx> wrote:
>>
>> On 29/11/2024 16:55, Rafael J. Wysocki wrote:
>>
>> [...]
>>
>>> For easier access, the series is available on the experimental/intel_ostate
>>> branch in linux-pm.git:
>>>
>>> https://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git/log/?h=experimental/intel_pstate
>>
>> I was wondering how we can test the EAS behaviour (power/perf) on Intel
>> hybrid machines.
>
> Thanks a lot for looking into this, much appreciated!
>
>> I have system-wide RAPL 'power/energy-{cores,pkg}' events for power
>> (energy) on my i7-13700K (nosmt) so I can run an rt-app workload
>> (e.g. 30 5% tasks (0.8ms/16ms)) with:
>>
>> perf stat -e power/energy-cores/,power/energy-pkg/ --repeat 10 ./rt-app.sh
>>
>> Plus I can check for negative slack for those rt-app-test tasks (perf)
>> and do ftrace-based task placement evaluation.
>>
>> base:
>>
>> Performance counter stats for 'system wide' (10 runs):
>>
>> 52.67 Joules power/energy-cores/ ( +- 1.24% )
>> 85.09 Joules power/energy-pkg/ ( +- 0.83% )
>>
>> 34.922801 +- 0.000736 seconds time elapsed ( +- 0.00% )
>>
>>
>> EAS:
>>
>> Performance counter stats for 'system wide' (10 runs):
>>
>> 45.55 Joules power/energy-cores/ ( +- 1.07% )
>> 75.73 Joules power/energy-pkg/ ( +- 0.67% )
>>
>> 34.93183 +- 0.00514 seconds time elapsed ( +- 0.01% )
>>
>> Do you have another (maybe more sophisticated) test methodology?
>
> Not really more sophisticated, but we cast a wider net, so to speak.
>
> For taks placement testing we use an internal utility that can create
> arbitrary synthetic workloads and plot CPU utilization (and other
> things) while they are running. It is kind of similar to rt-app
> AFAICS.
>
> We also run various benchmarks and measure energy usage during these
> runs, first in order to check if EAS helps in the cases when it is
> expected to help, but also to see how it affects the benchmark scores
> in general (because we don't want it to make too much of a "negative"
> difference for "performance" workloads).
Any insights always appreciated.
I have an OSPM talk accepted about the recent EAS overutilized
proposals, which does touch upon being able to switch out of EAS
quickly enough, too. I will be including some x86 results from our
test machine, too.
>
> The above results are basically in-line with what we are observing,
> but we often see less of a difference in terms of energy usage between
> the baseline and EAS enabled.
>
> We also see a lot of task migrations between CPUs in the "low-cost"
> PD, mostly in the utilization range where we would expect EAS to make
> a difference. Those migrations are a bit of a concern although they
> don't seem to affect benchmark scores.
>
> We think that those migrations are related to the preference of CPUs
> with the largest spare capacity, so I'm working on an alternative
> approach to enabling EAS that will use per-CPU PDs to see if the
> migrations can be reduced this way.
We've had something like this actually, you might be interested [1].
You'd want something more flexible in terms of the margins (or a
non-energy-based approach based on e.g. spare-cap [2]), but just
sidestepping the CPU selection within the cluster?
Is there anything specifically worrying you about frequent e-core
wakeup migrations? A few things come to mind immediately like:
Idle state latency, cache, DVFS, per-core internals like branch
predictor training, maybe turbo states would also favor the same
core(s) to be active?
(I've played with the series, too and still have lots of questions
on how this interact with turbo states, but given that we can't
really deterministically trigger them, trying to experiment/measure
anything seems rather futile?)
Interestingly if anything we were more interested in reducing CPU
wakeups in the big cores, because of their higher static leakage,
while little cores have low static leakage and low cpuidle wakeup
cost and latency.
[1]
It should be noted that we were always more concerned between the
uArch differences instead of breaking ties between intra-cluster CPUs,
simply because that's where the big efficiency gains are.
https://lore.kernel.org/lkml/20220412134220.1588482-1-vincent.donnefort@xxxxxxx/
[2]
Vincent Guittot's is currently proposing this, I don't think it would
work well ootb because of the single-OPP approach you took, but maybe
going from "same OPP" to e.g. "5% cap diff" remedies that?
https://lore.kernel.org/lkml/20241217160720.2397239-4-vincent.guittot@xxxxxxxxxx/
Anyway to provide something useful on this thread as well, testing
on our Raptor Lake with nosmt (=8+8) (note that this doesn't necessarily
translate into the lunar lake series that is the focus here).
I can reproduce the same efficiency gains of around 20-25% on
common workloads, e.g. 20 iterations of 5mins Firefox 4K youtube
video playback (Acquired by RAPL power/energy-cores/ in Joules):
EAS:
628.6145 +-30.4479693342421
CAS:
829.172 +-29.422507961369337
(-24.2% energy used with EAS)
FWIW Dietmar's patch of adding cpu_capacity sysfs for the intel_pstate
setup path is pretty handy for testing at least, maybe it could still
be considered for upstream:
https://lore.kernel.org/lkml/91b37d34-6d9a-4623-87d8-0ff648ea2415@xxxxxxx/