Re: [RFC PATCH 4/5] sched/fair: Use EAS also when overutilized
From: Vincent Guittot
Date: Tue Oct 01 2024 - 12:24:01 EST
On Thu, 26 Sept 2024 at 11:10, Quentin Perret <qperret@xxxxxxxxxx> wrote:
>
> Hi Vincent,
>
> On Wednesday 25 Sep 2024 at 15:27:45 (+0200), Vincent Guittot wrote:
> > On Fri, 20 Sept 2024 at 18:17, Quentin Perret <qperret@xxxxxxxxxx> wrote:
> > >
> > > Hi Vincent,
> > >
> > > On Friday 30 Aug 2024 at 15:03:08 (+0200), Vincent Guittot wrote:
> > > > Keep looking for an energy efficient CPU even when the system is
> > > > overutilized and use the CPU returned by feec() if it has been able to find
> > > > one. Otherwise fallback to the default performance and spread mode of the
> > > > scheduler.
> > > > A system can become overutilized for a short time when workers of a
> > > > workqueue wake up for a short background work like vmstat update.
> > > > Continuing to look for a energy efficient CPU will prevent to break the
> > > > power packing of tasks.
> > > >
> > > > Signed-off-by: Vincent Guittot <vincent.guittot@xxxxxxxxxx>
> > > > ---
> > > > kernel/sched/fair.c | 2 +-
> > > > 1 file changed, 1 insertion(+), 1 deletion(-)
> > > >
> > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > > index 2273eecf6086..e46af2416159 100644
> > > > --- a/kernel/sched/fair.c
> > > > +++ b/kernel/sched/fair.c
> > > > @@ -8505,7 +8505,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
> > > > cpumask_test_cpu(cpu, p->cpus_ptr))
> > > > return cpu;
> > > >
> > > > - if (!is_rd_overutilized(this_rq()->rd)) {
> > > > + if (sched_energy_enabled()) {
> > >
> > > As mentioned during LPC, when there is no idle time on a CPU, the
> > > utilization value of the tasks running on it is no longer a good
> > > approximation for how much the tasks want, it becomes an image of how
> > > much CPU time they were given. That is particularly problematic in the
> > > co-scheduling case, but not just.
> >
> > Yes, this is not always true when overutilized and true after a
> > certain amount of time. When a CPU is fully utilized without any idle
> > time anymore, feec() will not find a CPU for the task
>
> Well the problem is that is might actually find a CPU for the task -- a
> co-scheduled task can obviously look arbitrarily small from a util PoV.
With commit 50181c0cff31 ("sched/pelt: Avoid underestimation of task
utilization"), the util_est remains set the value before having to
share the cpu with other tasks which means that the util_est remains
correct even if its util_avg decrease because of sharing the cpu with
other task. This has been done to cover the cases that you mention
above whereboth util_avg and util_est where decreasing when tasks
starts to share the CPU bandwidth with others
>
> > >
> > > IOW, when we're OU, the util values are bogus, so using feec() is frankly
> > > wrong IMO. If we don't have a good idea of how long tasks want to run,
> >
> > Except that the CPU is not already fully busy without idle time when
> > the system is overutilized. We have ~20% margin on each CPU which
> > means that system are overutilized as soon as one CPU is more than 80%
> > utilized which is far from not having idle time anymore. So even when
> > OU, it doesn't mean that all CPUs don't have idle time and most of the
> > time the opposite happens and feec() can still make a useful decision.
>
> My problem with the proposed change here is that it doesn't at all
> distinguish between the truly overloaded case (when we have more compute
> demand that resources) from a system with a stable-ish utilization at
> 90%. If you're worried about the latter, then perhaps we should think
> about redefining the OU threshold some other way (either by simply
> making higher or configurable, or changing its nature to look at the
we definitely increase the OU threshold but we still have case with
truly overutilized CPU but still good utilization value
> last time we actually got idle time in the system). But I'm still rather
> opinionated that util-based placement is wrong for the former.
And feec() will return -1 for that case because util_est remains high
>
> And for what it's worth, in my experience if any of the big CPUs get
> anywhere near the top of their OPP range, given that the power/perf
> curve is exponential it's being penny-wise and pound-foolish to
> micro-optimise the placement of the other smaller tasks from an energy
> PoV at the same time. But if we can show that it helps real use-cases,
> then why not.
The thermal mitigation and/or power budget policy quickly reduce the
max compute capacity of such big CPUs becomes overutilized with lower
OPP which reduce the diff between big/medium/little
>
> > Also, when there is no idle time on a CPU, the task doesn't fit and
> > feec() doesn't return a CPU.
>
> It doesn't fit on that CPU but might still (incorrectly) fit on another
> CPU right?
the commit that I mentioned above covers those cases and the task will
not incorrectly fit to another smaller CPU because its util_est is
preserved during the overutilized phase
>
> > Then, the old way to compute invariant utilization was particularly
> > sensible to the overutilized state because the utilization was capped
> > and asymptotically converging to max cpu compute capacity but this is
> > not true with the new pelt and we can go above compute capacity of the
> > cpu and remain correct as long as we are able to increase the compute
> > capacity before that there is no idle time. In theory, the utilization
> > "could" be correct until we reach 1024 (for utilization or runnable)
> > and there is no way to catch up the temporary under compute capacity.
> >
> > > the EM just can't help us with anything so we should stay away from it.
> > >
> > > I understand how just plain bailing out as we do today is sub-optimal,
> > > but whatever we do to improve on that can't be doing utilization-based
> > > task placement.
> > >
> > > Have you considered making the default (non-EAS) wake-up path a little
> > > more reluctant to migrations when EAS is enabled? That should allow us
> > > to maintain a somewhat stable task placement when OU is only transient
> > > (e.g. due to misfit), but without using util values when we really
> > > shouldn't.
> > >
> > > Thoughts?
> >
> > As mentioned above OU doesn't mean no idle time anymore and in this
> > case utilization is still relevant
>
> OK, but please distinguish this from the truly overloaded case somehow,
> I really don't think we can 'break' it just to help with the corner case
> when we've got 90% ish util.
>
> > In would be in favor of adding
> > more performance related decision into feec() similarly to have is
> > done in patch 3 which would be for example that if a cpu doesn't fit
> > we could still return a CPU with more performance focus
>
> Fine with me in principle as long as we stop using utilization as a
> proxy for how much a task wants when it really isn't that any more.
>
> Thanks!
> Quentin
>
> > >
> > > Thanks,
> > > Quentin
> > >
> > > > new_cpu = find_energy_efficient_cpu(p, prev_cpu);
> > > > if (new_cpu >= 0)
> > > > return new_cpu;
> > > > --
> > > > 2.34.1
> > > >
> > > >