Re: [RFC PATCH 1/2] thermal/cpufreq_cooling: remove unused cpu_idx in get_load()
From: Qais Yousef
Date: Sat Mar 28 2026 - 04:13:47 EST
On 03/26/26 09:21, Lukasz Luba wrote:
>
>
> On 3/26/26 09:05, Qais Yousef wrote:
> > On 03/24/26 10:46, Lukasz Luba wrote:
> > >
> > > On 3/24/26 02:20, Xuewen Yan wrote:
> > > > On Mon, Mar 23, 2026 at 9:25 PM Lukasz Luba <lukasz.luba@xxxxxxx> wrote:
> > > > >
> > > > >
> > > > >
> > > > > On 3/23/26 11:06, Viresh Kumar wrote:
> > > > > > On 23-03-26, 10:52, Lukasz Luba wrote:
> > > > > > > > How is that okay ? What am I missing ?
> > > > > >
> > > > > > I was missing !SMP :)
> > > > > >
> > > > > > > Right, there is a mix of two things.
> > > > > > > The 'i' left but should be removed as well, since
> > > > > > > this is !SMP code with only 1 cpu and i=0.
> > > >
> > > > That's also why we sent out patch 1/2; after all, it is always 0 on
> > > > !SMP systems.
> > > >
> > > > > > >
> > > > > > > The whole split which has been made for getting
> > > > > > > the load or utilization from CPU(s) needs to be
> > > > > > > cleaned. The compiled code looks different since
> > > > > > > it knows there is non-SMP config used.
> > > > > >
> > > > > > Right, we are allocating that for num_cpus (which should be 1 CPU
> > > > > > anyway). The entire thing must be cleaned.
> > > > > >
> > > > > > > Do you want to clean that or I should do this?
> > > > > >
> > > > > > It would be helpful if you can do it :)
> > > > > >
> > > > >
> > > > > OK, I will. Thanks for your involvement Viresh!
> > > > >
> > > > > Xuewen please wait with your v2, I will send
> > > > > a redesign of this left code today.
> > > >
> > > > Okay, and Qais's point is also worth considering: do we actually need
> > > > sched_cpu_util()?
> > > > The way I see it, generally speaking, the request_power derived from
> > > > idle_time might be higher than what we get from sched_cpu_util().
> > > > Take this scenario as an example:
> > > > Consider a CPU running at the lowest frequency with 50% idle time,
> > > > versus one running at the highest frequency with the same 50% idle
> > > > time.
> > > > In this case, using idle_time yields the same load value for both.
> > > > However, sched_cpu_util() would report a lower load when the CPU
> > > > frequency is low. This results in a smaller request_power...
> >
> > Invariance will cause settling time to stretch longer, but it should settle to
> > the correct value eventually. But generally another case against util is that
> > it has grown to be a description of compute demand more than true idleness of
> > the system.
> >
> > >
> > > Right, there are 2 things to consider:
> > > 1. what is the utilization when the CPU still have idle time, e.g.
> > > this 50% that you mentioned
> > > 2. what is the utilization when there is no idle time and CPU
> > > is fully busy (and starts throttling due to heat)
> >
> > Hmm I think what you're trying to say here we need to distinguish between two
> > cases 50% or fully busy? I think how idle the system is a better question to
> > ask rather than what is the utilization (given the ubiquity of the signal
> > nowadays)
>
> Yes, these two cases, which are different and util signal is not the
> best for that idleness one.
>
>
> >
> > >
> > > In this thermal fwk we are mostly in the 2nd case. In that case the
> >
> > But from power allocator perspective (which I think is the context, right?),
> > you want to know if you can shift power?
>
> I would like to know the avg power in the last X ms window, then
> allocate, shift, set.
>
> >
> > > utilization on CPU's runqueue goes to 1024 no mater the CPU's frequency.
> > > We know which highest frequency was allowed to run and we pick the power
> > > value from EM for it. That's why the estimation is not that bad (apart
> > > from power variation for different flavors of workloads: heavy SIMD vs.
> > > normal integer/load).
> > >
> > > In 1st case scenario we might underestimate the power, but that
> > > is not the thermal stress situation anyway, so the max OPP is
> > > still allowed.
> > >
> > > So far it is hard to find the best power model to use and robust CPU
> > > load mechanisms. Adding more complexity and creating some
> > > over-engineered code in the kernel to maintain might not have sense.
> > > The thermal solutions are solved in the Firmware nowadays since the
> > > kernel won't react that fast for some rapid changes.
> > >
> > > We have to balance the complexity here.
> >
> > I am not verse in all the details, so not sure what complexity you are
> > referring to. IMHO the idle time is a more stable view for how much a breathing
> > room the cpu has. It also deals better with long decay of blocked load
> > over-estimating the utilization. AFAICS just sample the idle over a window when
> > you need to take a decision and you'd solve several problems in one go.
>
> We have issues in estimating power in that X ms window due to fast
> frequency changes. You know how often we can change the frequency,
> almost per-task enqueue (and e.g. uclamp pushes that even harder).
>
> Simple approach for assuming that the frequency we see now on CPU
> has been there for the whole Xms period is 'not the best'.
> The util information w/o uclamp information is not helping
> much (even we we would try to derive the freq out of it).
>
> Now even more complex - the FW can change the freq way often
> than the kernel. So the question is how far we have to push
> the whole kernel and those frameworks to deal with those new
> platforms.
>
> Then add the power variation due to different computation types
> e.g. SIMD heavy vs simple logging task (high power vs. low power
> usage at the same OPP).
>
> IMHO we have to find a balance since even more complex models
> in kernel won't be able to handle that.
>
> I have been experimenting with the Active Stats patch set for
> quite a while, but then FW came into the equation and complicated
> the situation. It will be still better for such platform
> where FW doesn't change the freq, so this approach based on
> idle stats is worth to add IMO.
Hmm isn't this orthogonal? It seems cpu util is used today to estimate how busy
(or idle) the cpu is. You can decouple the dep on util (and scheduler in
general) and monitor system idleness.
Anyway. Please don't add this ifdefry and the strange deps on scx. This is
a recipe for shooting ourselves in the foot.