Re: [PATCH 3/5] sched/fair: Rework feec() to use cost instead of spare capacity

From: Vincent Guittot
Date: Thu Sep 12 2024 - 08:23:01 EST


On Wed, 11 Sept 2024 at 16:03, Pierre Gondois <pierre.gondois@xxxxxxx> wrote:
>
> Hello Vincent,
>
> On 8/30/24 15:03, Vincent Guittot wrote:
> > feec() looks for the CPU with highest spare capacity in a PD assuming that
> > it will be the best CPU from a energy efficiency PoV because it will
> > require the smallest increase of OPP. Although this is true generally
> > speaking, this policy also filters some others CPUs which will be as
> > efficients because of using the same OPP.
> > In fact, we really care about the cost of the new OPP that will be
> > selected to handle the waking task. In many cases, several CPUs will end
> > up selecting the same OPP and as a result using the same energy cost. In
> > these cases, we can use other metrics to select the best CPU for the same
> > energy cost.
> >
> > Rework feec() to look 1st for the lowest cost in a PD and then the most
> > performant CPU between CPUs.
> >
> > Signed-off-by: Vincent Guittot <vincent.guittot@xxxxxxxxxx>
> > ---
> > kernel/sched/fair.c | 466 +++++++++++++++++++++++---------------------
> > 1 file changed, 244 insertions(+), 222 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index e67d6029b269..2273eecf6086 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
>
> [snip]
>
> >
> > -/*
> > - * compute_energy(): Use the Energy Model to estimate the energy that @pd would
> > - * consume for a given utilization landscape @eenv. When @dst_cpu < 0, the task
> > - * contribution is ignored.
> > - */
> > -static inline unsigned long
> > -compute_energy(struct energy_env *eenv, struct perf_domain *pd,
> > - struct cpumask *pd_cpus, struct task_struct *p, int dst_cpu)
> > +/*Check if the CPU can handle the waking task */
> > +static int check_cpu_with_task(struct task_struct *p, int cpu)
> > {
> > - unsigned long max_util = eenv_pd_max_util(eenv, pd_cpus, p, dst_cpu);
> > - unsigned long busy_time = eenv->pd_busy_time;
> > - unsigned long energy;
> > + unsigned long p_util_min = uclamp_is_used() ? uclamp_eff_value(p, UCLAMP_MIN) : 0;
> > + unsigned long p_util_max = uclamp_is_used() ? uclamp_eff_value(p, UCLAMP_MAX) : 1024;
> > + unsigned long util_min = p_util_min;
> > + unsigned long util_max = p_util_max;
> > + unsigned long util = cpu_util(cpu, p, cpu, 0);
> > + struct rq *rq = cpu_rq(cpu);
> >
> > - if (dst_cpu >= 0)
> > - busy_time = min(eenv->pd_cap, busy_time + eenv->task_busy_time);
>
> I think you should mention that the energy computation is not capped anymore.
> It used to be:
> pd_util: sum of the CPU's util for the pd considered, without task_util
> pd_cap: sum of the CPU's capacity for the pd
>
> (a)
> busy_time = min(pd_cap, pd_util);
> prev_energy = busy_time * OPP[prev_max_util].cost;
>
> busy_time = min(pd_cap, pd_util + task_util);
> new_energy = busy_time * OPP[new_max_util].cost;
>
> delta_energy = new_energy - prev_energy;
>
> Note that when computing busy_time, task_util is not capped to one CPU's
> max_cap. This means that in empty pd, a task can 'steal' capacity from
> CPUs during the energy computation.
> Cf. [1]
>
> and it is now:
> (b)
> delta_energy = task_util * OPP[new_max_util].cost;
> delta_energy += pd_util * (OPP[new_max_util].cost - OPP[prev_max_util].cost);
>
> Note that the busy_time (task_util + pd_util) is now not capped by anything.

As discussed in [1], capping utilization with max capacity is a
mistake because we lost the information that this will run longer

>
> ---
>
> Not capping the task_util is discussed in [1][3] and [2] (at 21:15).
> IIUC, UCLAMP_MAX tasks are the only case leveraging this. Indeed,
> non-clamped tasks will not fit and be placed on a bigger CPU. Same for
> UCLAMP_MIN tasks.
> FWIU, not capping the utilization of tasks during the energy computation
> allows to reflect that a task will (likely) run longer. However the
> task's performance would likely decrease as the other tasks on the target
> CPU are not taken into account (it is assumed the task to be placed will
> receive the compute power it requires).
>
> ---
> Case A:
>
> Assuming we have an empty system with:
> - 4 little CPUs (max_capa=512, first OPP as [capa=256, cost=10])
> - 2 big CPUs (max_capa=1024, first OPP as [capa=512, cost=10])

Quite an inefficient hardware design here where the big core provides
twice more capacity for the same cost of the little for their 1st OPP
!!!

> i.e. the first OPP of all the CPU consumes the same amount of energy.
> And a task with: [UCLAMP_MIN=0, UCLAMP_MAX=10, util = 1000]
>
> Then feec() would have no reason to prefer a big CPU over a little CPU,
> even though the big CPU would provide more performance.

As mentioned in the comment of feec(), I don't expect to face a
situation where the delta energy is equal for 2 PDs especially with
the uWatt that has been introduced to prevent such situation. But if
this should happen, it is in the TODO to add margin and to take other
stats into account like compute capacity. Also, the PDs are sorted by
max capacity so we currently keep the one with highest capacity ie big
CPU in your case

>
> ---
> Case B:
>
> (This is not especially related to this patch.)
> Another case that might be problematic:
> - 4 little CPUs (max_capa=512, first OPP as [capa=256])
> - 2 big CPUs (max_capa=1024, first OPP as [capa=512])
> - little CPUs consume less than big CPUs, but the highest OPP
> of the little CPUs consume more than the lowest of the big CPUs.
> And tasks:
> - 3 tasks with [UCLAMP_MIN=0, UCLAMP_MAX=10, util = 1000]
> - 1 task with [UCLAMP_MIN=0, UCLAMP_MAX=1024, util = 50]
>
> Then
> - the 3 UCLAMP_MAX tasks will be placed on the little CPUs. Indeed,
> due to the last patch of the serie, these tasks have now an opportunity
> to run feec() and be placed on a more energy efficient CPU.
> - the 'normal' task will be placed on a big CPU. Indeed, placing
> it on a little CPU would raise the OPP of the little cluster.
>
> This means that the 'normal' task is prevented to run the remaining little
> CPU even though:
> - the little CPU can provide the compute capacity
> - the little CPU would consume less energy
>
> In other terms, using UCLAMP_MAX on some tasks makes the system consume
> more energy.

You have probably noticed that this patchset doesn't make any
assumption about uclamp_max/min behavior and the below doesn't seem to
be related to this patchset but to the uclamp_max behavior so I don't
think it's the right place to discuss this. A talk or a BoF at LPC
would have been a better place

>
> ---
>
> In my opinion, this last case comes from the difficulty of defining UCLAMP_MAX.
> From sched-util-clamp.rst (about UCLAMP_MAX):
> - Uclamp is a hinting mechanism that allows the scheduler to understand the
> performance requirements and restrictions of the tasks
> - The right way to view util clamp is as a mechanism to make request or hint on
> performance constraints.
> - some tasks should be restricted from consuming too
> much resources and should not go above a specific performance point.
> -
> Another example is in Android where tasks are classified as background,
> foreground, top-app, etc. Util clamp can be used to constrain how much
> resources background tasks are consuming by capping the performance point they
> can run at. This constraint helps reserve resources for important tasks, like
> the ones belonging to the currently active app (top-app group). Beside this
> helps in limiting how much power they consume. This can be more obvious in
> heterogeneous systems (e.g. Arm big.LITTLE); the constraint will help bias the
> background tasks to stay on the little cores which will ensure that:
>
> 1. The big cores are free to run top-app tasks immediately. top-app
> tasks are the tasks the user is currently interacting with, hence
> the most important tasks in the system.
> 2. They don't run on a power hungry core and drain battery even if they
> are CPU intensive tasks.
> -
> For example, it can be handy to limit performance when running low on battery
> or when the system wants to limit access to more energy hungry performance
> levels when it's in idle state or screen is off.
>
> """
> This constraint helps reserve resources for important tasks, like
> the ones belonging to the currently active app (top-app group).
> """
> It doesn't seem that UCLAMP_MAX does this. This looks more like bandwidth
> control.
>
> """
> They don't run on a power hungry core and drain battery even if they
> are CPU intensive tasks.
> """
> Avoiding mid/big CPUs could be done with tasksets,
>
> I can understand that one might want to run a task at a higher OPP due to
> timing constraints for instance. However I don't see why someone would want
> to run a task at a lower OPP, regarding only the performance and not the
> energy consumption. It thus means that UCLAMP_MAX is an energy hint rather
> of a performance hint.
>
> UCLAMP_MAX could be set for a task to make it spend less energy, but the
> loss in performance would be unknown.
> A case Hongyan mentioned in his uclamp sum aggregation serie [4] is that
> an infinite task with [UCLAMP_MIN=0, UCLAMP_MAX=1, util = 1000] could fit
> in a little CPU. The energy consumption would indeed be very low, but the
> performance would also be very low.
>
> With Hongyan's sum aggregation serie [5]:
> - case B would not happen as the 'normal' task would not raise the OPP of
> the whole cluster.
> - the 'infinite UCLAMP_MAX tasks' case would not happen as each task would
> account for 1 util
> - case A would still happen, but could be solved in any case.
>
> I know Hongyan's patchset has already been discussed, but I still don't
> understand why max aggregation is preferred over sum aggregation.
> The definition of UCLAMP_MAX values seems clearer and in effect results in
> a simpler implementation and less corner cases. In simple words:
> "When estimating the CPU frequency to set, for this task,
> account for at most X util."
> rather than:
> "When estimating the CPU frequency to set, the task with the highest
> UCLAMP_MAX of the CPU will cap the requested frequency."
>
> Note that actually I think it's a good idea to run feec() regularly
> and to take into account other parameters like nr_running. I just think
> that UCLAMP_MAX's max aggregation creates corner cases that are difficult
> to solve altogether.
>
> Thanks in advance for the time you will take answering,
> Regards,
> Pierre
>
> [1] https://lore.kernel.org/all/20240606070645.3295-1-xuewen.yan@xxxxxxxxxx/
> [2] https://www.youtube.com/watch?v=PHEBAyxeM_M
> [3] https://lore.kernel.org/all/CAKfTPtDPCPYvCi1c_Nh+Cn01ZVS7E=tAHQeNX-mArBt3BXdjYw@xxxxxxxxxxxxxx/
> [4] https://lore.kernel.org/all/b81a5b1c-14de-4232-bee9-ee647355dd8c@xxxxxxx/
> [5] https://lore.kernel.org/all/cover.1706792708.git.hongyan.xia2@xxxxxxx/#t