Re: [PATCH] sched/fair: Fix frequency selection for non invariant case

From: Jon Hunter
Date: Wed Feb 14 2024 - 12:57:48 EST



On 14/02/2024 17:22, Vincent Guittot wrote:
On Wed, 14 Feb 2024 at 18:20, Linus Torvalds
<torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:

On Wed, 14 Feb 2024 at 09:12, Jon Hunter <jonathanh@xxxxxxxxxx> wrote:

We have also observed a performance degradation on our Tegra platforms
with v6.8-rc1. Unfortunately, the above change does not fix the problem
for us and we are still seeing a performance issue with v6.8-rc4. For
example, running Dhrystone on Tegra234 I am seeing the following ...

Linux v6.7:
[ 2216.301949] CPU0: Dhrystones per Second: 31976326 (18199 DMIPS)
[ 2220.993877] CPU1: Dhrystones per Second: 49568123 (28211 DMIPS)
[ 2225.685280] CPU2: Dhrystones per Second: 49568123 (28211 DMIPS)
[ 2230.364423] CPU3: Dhrystones per Second: 49632220 (28248 DMIPS)

Linux v6.8-rc4:
[ 44.661686] CPU0: Dhrystones per Second: 16068483 (9145 DMIPS)
[ 51.895107] CPU1: Dhrystones per Second: 16077457 (9150 DMIPS)
[ 59.105410] CPU2: Dhrystones per Second: 16095436 (9160 DMIPS)
[ 66.333297] CPU3: Dhrystones per Second: 16064000 (9142 DMIPS)

If I revert this change and the following ...

b3edde44e5d4 ("cpufreq/schedutil: Use a fixed reference frequency")
f12560779f9d ("sched/cpufreq: Rework iowait boost")
9c0b4bb7f630 ("sched/cpufreq: Rework schedutil governor

... then the perf is similar to where it was ...

Ok, guys, this whole scheduler / cpufreq rewrite seems to have been
completely buggered.

Please tell me why we shouldn't just revert things as per above?

Sure, the problem _I_ experienced is fixed, but apparently there are
others just lurking, and they are even bigger degradations than the
one I saw.

We're now at rc4, we're not releasing a 6.8 with the above kinds of
numbers. So either there's another obvious one-liner fix, or we need
to revert this whole thing.

This should fix it:
https://lore.kernel.org/lkml/20240117190545.596057-1-vincent.guittot@xxxxxxxxxx/


Yes I can confirm that this does fix it ...

[ 29.440836] CPU0: Dhrystones per Second: 48340366 (27513 DMIPS)
[ 34.221323] CPU1: Dhrystones per Second: 48585127 (27652 DMIPS)
[ 38.988036] CPU2: Dhrystones per Second: 48667266 (27699 DMIPS)
[ 43.769430] CPU3: Dhrystones per Second: 48544161 (27629 DMIPS)

Yes, dhrystones is a truly crappy benchmark, but partly _because_ it's
such a horribly bad benchmark it's also a very simple case. It's pure
CPU load with absolutely nothing interesting going on. Regressing on
that by a factor of three is a sign of complete failure.


We have a few other more extensive tests that have been failing due to the perf issue. We will run those with the above and if we see any more issues I will let everyone know.

Thanks
Jon

--
nvpublic