Re: [RFC PATCH 0/7] sched: cpufreq: Remove magic margins

From: Qais Yousef
Date: Sun Sep 10 2023 - 15:06:29 EST


On 09/07/23 15:42, Lukasz Luba wrote:
>
>
> On 9/7/23 15:29, Peter Zijlstra wrote:
> > On Thu, Sep 07, 2023 at 02:57:26PM +0100, Lukasz Luba wrote:
> > >
> > >
> > > On 9/7/23 14:26, Peter Zijlstra wrote:
> > > > On Wed, Sep 06, 2023 at 10:18:50PM +0100, Qais Yousef wrote:
> > > >
> > > > > This is probably controversial statement. But I am not in favour of util_est.
> > > > > I need to collect the data, but I think we're better with 16ms PELT HALFLIFE as
> > > > > default instead. But I will need to do a separate investigation on that.
> > > >
> > > > I think util_est makes perfect sense, where PELT has to fundamentally
> > > > decay non-running / non-runnable tasks in order to provide a temporal
> > > > average, DVFS might be best served with a termporal max filter.
> > > >
> > > >
> > >
> > > Since we are here...
> > > Would you allow to have a configuration for
> > > the util_est shifter: UTIL_EST_WEIGHT_SHIFT ?
> > >
> > > I've found other values than '2' better in some scenarios. That helps
> > > to prevent a big task to 'down' migrate from a Big CPU (1024) to some
> > > Mid CPU (~500-700 capacity) or even Little (~120-300).
> >
> > Larger values, I'm thinking you're after? Those would cause the new
> > contribution to weight less, making the function more smooth, right?
>
> Yes, more smooth, because we only use the 'ewma' goodness for decaying
> part (not the raising [1]).
>
> >
> > What task characteristic is tied to this? That is, this seems trivial to
> > modify per-task.
>
> In particular Speedometer test and the main browser task, which reaches
> ~900util, but sometimes vanish and waits for other background tasks
> to do something. In the meantime it can decay and wake-up on
> Mid/Little (which can cause a penalty to score up to 5-10% vs. if
> we pin the task to big CPUs). So, a longer util_est helps to avoid
> at least very bad down migration to Littles...

Warning, this is not a global win! We do want tasks in general to downmigrate
when they sleep. Would be great to avoid biasing towards perf first by default
to fix these special cases.

As I mentioned in other reply, there's a perf/power/thermal impact of these
decisions and it's not a global win. Some might want this to improve their
scores, others might not want that and rather get the worse score but keep
their power budget in check. And it will highly depend on the workload and the
system. Which we can't test everyone of them :(

We did give the power to userspace via uclamp which should make this problem
fixable. And this is readily available. We can't basically know in the kernel
when one way is better than the other without being told explicitly IMHO.

If you try to boot with faster PELT HALFLIFE, would this give you the same
perf/power trade-off?


Thanks

--
Qais Yousef

>
> [1] https://elixir.bootlin.com/linux/v6.5.1/source/kernel/sched/fair.c#L4442