Re: [RFC PATCH 0/7] sched: cpufreq: Remove magic margins

From: Peter Zijlstra
Date: Fri Sep 08 2023 - 06:26:42 EST


On Fri, Sep 08, 2023 at 01:17:25AM +0100, Qais Yousef wrote:

> Just to be clear, my main issue here with the current hardcoded values of the
> 'margins'. And the fact they go too fast is my main problem.

So I stripped the whole margin thing from my reply because I didn't want
to comment on that yet, but yes, I can see how those might be a problem,
and you're changing them into something dynamic, not just removing them.

The tunables is what I worry most about. The moment we expose knobs it
becomes really hard to change things later.

> UTIL_EST_FASTER moves in one direction. And it's a constant response too, no?

The idea of UTIL_EST_FASTER was that we run a PELT sum on the current
activation runtime, all runtime since wakeup and take the max of this
extra sum and the regular thing.

On top of that this extra PELT sum can/has a time multiplier and thus
ramps up faster (this multiplies could be per task). Nb.:

util_est_fast = faster_est_approx(delta * 2);

is a state-less expression -- by making

util_est_fast = faster_est_approx(delta * curr->se.faster_mult);

only the current task is affected.

> I didn't get the per-task configurability part. AFAIU we can't turn off these
> sched-features if they end up causing power issues. That what makes me hesitant
> about them.

See above, the extra sum is (fundamentally) per task, the multiplier
could be per task, if you set the multiplier to <=1, you'll never gain on
the existing sum and the max filter makes that the feature is
effectively disabled for the one task.

It of course gets us the problem of how to set the new multiplier... ;-)

> There's a bias towards perf. But some systems prefer to save power
> at the expense of perf. There's a lot of grey areas in between to what
> perceived as a suitable trade-off for perf vs power. There are cases like above
> where actually you can lower freqs without hit on perf. But most of the time
> it's a trade-off; and some do decide to drop perf in favour of power. Keep in
> mind battery capacity differs between systems with the same SoC even. Some ship
> to enable more perf, others are more constrained and opt to be more efficient.

It always depends on the workload too -- you want different trade-offs
for different tasks.

> > I'm *really* hesitant on adding all these mostly random knobs -- esp.
> > without strong justification -- which you don't present. You mostly seem
> > to justify things with: people do random hack, we should legitimize them
> > hacks.
>
> I share your sentiment and I am trying to find out what's the right thing to do
> really. I am open to explore other territories. But from what I see there's
> a real need to give users the power to tune how responsive their system needs
> to be. I can't see how we can have one size that fits all here given the
> different capabilities of the systems and the desired outcome (I want more perf
> vs more efficiency).

This is true; but we also cannot keep adding random knobs. Knobs that
are very specific are hard constraints we've got to live with. Take for
instance uclamp, that's not something we can ever get rid of I think
(randomly picking on uclamp, not saying I'm hating on it).

>From an actual interface POV, the unit-less generic energy-vs-perf knob
is of course ideal, one global and one per task and then we can fill out
the details as we see fit. System integrators (you say users, but
really, not a single actual user will use any of this) can muck about
and see what works for them.

(even hardware has these things today, we get 0-255 values that do
'something' uarch specific)

> The problem is that those 0.8 and 1.25 margins forces unsuitable default. The
> case I see the most is it is causing wasting power that tuning it down regains
> this power at no perf cost or small one. Others actually do tune it for faster
> response, but I can't cover this case in detail. All I know is lower end
> systems do struggle as they don't have enough oomph. I also saw comparison on
> phoronix where schedutil is not doing as good still - which tells me it seems
> server systems do prefer to ramp up faster too. I think that PELT thread is
> a variation of the same problem.
>
> So one of the things I saw is a workload where it spends majority of the time
> in 600-750 util_avg range. Rarely ramps up to max. But the workload under uses
> the medium cores and runs at a lot higher freqs than it really needs on bigs.
> We don't end up utilizing our resources properly.

So that is actually a fairly solid argument for changing things up, if
the margin causes us to neglect mid cores then that needs fixing. But I
don't think that means we need a tunable. After all, the system knows it
has 3 capacities, it just needs to be better at mapping workloads to
them.

It knows how much 'room' there is between a mid and a large. If 1.25*mid
> large we in trouble etc..

> There's a question that I'm struggling with if I may ask. Why is it perceived
> our constant response time (practically ~200ms to go from 0 to max) as a good
> fit for all use cases? Capability of systems differs widely in terms of what
> performance you get at say a util of 512. Or in other words how much work is
> done in a unit of time differs between system, but we still represent that work
> in a constant way. A task ran for 10ms on powerful System A would have done
> a lot more work than running on poor System B for the same 10ms. But util will
> still rise the same for both cases. If someone wants to allow this task to be
> able to do more on those 10ms, it seems natural to be able to control this
> response time. It seems this thinking is flawed for some reason and I'd
> appreciate a help to understand why. I think a lot of us perceive this problem
> this way.

I think part of the problem is that todays servers are tomorrow's
smartphones. Back when we started all this PELT nonsense computers in
general were less powerful than they are now, yet todays servers are no
less busy than they were back then.

Give us compute, we'll fill it.

Now, smartphones in particular are media devices, but a large part of
the server farms are indirectly interactive too, you don't want your
search query to take too long, or your bookface page stuck loading, or
you twatter message about your latest poop not being insta read by your
mates.

That is, much of what we do with the computers, ever more powerful or
not, is eventually measured in human time perception.

So yeah, 200ms.

Remember, all this PELT nonsense was created for cgroups, to distribute
shared between CPUs as load demanded. I think for that purpose it still
sorta makes sense.

Ofc we've added a few more users over time, because if you have this
data, might as well use it etc. I'm not sure we really sat down and
analyzed if the timing all made sense.

And as I argued elsewhere, PELT is a running average, but DVFS might be
better suited with a max filter.

> Happy to go and try to dig more info if this is still not clear enough.

So I'm not generally opposed to changing things -- but I much prefer to
have an actual problem driving that change :-)