Re: [RFC PATCH 1/2] thermal/cpufreq_cooling: remove unused cpu_idx in get_load()

From: Lukasz Luba

Date: Thu Mar 26 2026 - 05:29:50 EST

On 3/26/26 09:05, Qais Yousef wrote:

On 03/24/26 10:46, Lukasz Luba wrote:

On 3/24/26 02:20, Xuewen Yan wrote:

On Mon, Mar 23, 2026 at 9:25 PM Lukasz Luba <lukasz.luba@xxxxxxx> wrote:

On 3/23/26 11:06, Viresh Kumar wrote:

On 23-03-26, 10:52, Lukasz Luba wrote:

How is that okay ? What am I missing ?

I was missing !SMP :)

Right, there is a mix of two things.
The 'i' left but should be removed as well, since
this is !SMP code with only 1 cpu and i=0.

That's also why we sent out patch 1/2; after all, it is always 0 on
!SMP systems.

The whole split which has been made for getting
the load or utilization from CPU(s) needs to be
cleaned. The compiled code looks different since
it knows there is non-SMP config used.

Right, we are allocating that for num_cpus (which should be 1 CPU
anyway). The entire thing must be cleaned.

Do you want to clean that or I should do this?

It would be helpful if you can do it :)

OK, I will. Thanks for your involvement Viresh!

Xuewen please wait with your v2, I will send
a redesign of this left code today.

Okay, and Qais's point is also worth considering: do we actually need
sched_cpu_util()?
The way I see it, generally speaking, the request_power derived from
idle_time might be higher than what we get from sched_cpu_util().
Take this scenario as an example:
Consider a CPU running at the lowest frequency with 50% idle time,
versus one running at the highest frequency with the same 50% idle
time.
In this case, using idle_time yields the same load value for both.
However, sched_cpu_util() would report a lower load when the CPU
frequency is low. This results in a smaller request_power...

Invariance will cause settling time to stretch longer, but it should settle to
the correct value eventually. But generally another case against util is that
it has grown to be a description of compute demand more than true idleness of
the system.

Right, there are 2 things to consider:
1. what is the utilization when the CPU still have idle time, e.g.
this 50% that you mentioned
2. what is the utilization when there is no idle time and CPU
is fully busy (and starts throttling due to heat)

Hmm I think what you're trying to say here we need to distinguish between two
cases 50% or fully busy? I think how idle the system is a better question to
ask rather than what is the utilization (given the ubiquity of the signal
nowadays)

Yes, these two cases, which are different and util signal is not the
best for that idleness one.

In this thermal fwk we are mostly in the 2nd case. In that case the

But from power allocator perspective (which I think is the context, right?),
you want to know if you can shift power?

I would like to know the avg power in the last X ms window, then
allocate, shift, set.

utilization on CPU's runqueue goes to 1024 no mater the CPU's frequency.
We know which highest frequency was allowed to run and we pick the power
value from EM for it. That's why the estimation is not that bad (apart
from power variation for different flavors of workloads: heavy SIMD vs.
normal integer/load).

In 1st case scenario we might underestimate the power, but that
is not the thermal stress situation anyway, so the max OPP is
still allowed.

So far it is hard to find the best power model to use and robust CPU
load mechanisms. Adding more complexity and creating some
over-engineered code in the kernel to maintain might not have sense.
The thermal solutions are solved in the Firmware nowadays since the
kernel won't react that fast for some rapid changes.

We have to balance the complexity here.

I am not verse in all the details, so not sure what complexity you are
referring to. IMHO the idle time is a more stable view for how much a breathing
room the cpu has. It also deals better with long decay of blocked load
over-estimating the utilization. AFAICS just sample the idle over a window when
you need to take a decision and you'd solve several problems in one go.

We have issues in estimating power in that X ms window due to fast
frequency changes. You know how often we can change the frequency,
almost per-task enqueue (and e.g. uclamp pushes that even harder).

Simple approach for assuming that the frequency we see now on CPU
has been there for the whole Xms period is 'not the best'.
The util information w/o uclamp information is not helping
much (even we we would try to derive the freq out of it).

Now even more complex - the FW can change the freq way often
than the kernel. So the question is how far we have to push
the whole kernel and those frameworks to deal with those new
platforms.

Then add the power variation due to different computation types
e.g. SIMD heavy vs simple logging task (high power vs. low power
usage at the same OPP).

IMHO we have to find a balance since even more complex models
in kernel won't be able to handle that.

I have been experimenting with the Active Stats patch set for
quite a while, but then FW came into the equation and complicated
the situation. It will be still better for such platform
where FW doesn't change the freq, so this approach based on
idle stats is worth to add IMO.