Re: [RFC PATCH 4/5] sched/fair: Use EAS also when overutilized
From: Vincent Guittot
Date: Thu Oct 03 2024 - 04:58:39 EST
On Thu, 3 Oct 2024 at 10:21, Quentin Perret <qperret@xxxxxxxxxx> wrote:
>
> On Thursday 03 Oct 2024 at 08:27:00 (+0200), Vincent Guittot wrote:
> > On Tue, 1 Oct 2024 at 19:51, Quentin Perret <qperret@xxxxxxxxxx> wrote:
> > > And again, checking that a task fits is broken to start with if we don't
> > > know how big the task is. When we have reasons to believe that the util
> > > values are no longer correct (and the absence of idle time is a very
> > > good reason for that) we just need to give up on them. The fact that we
> > > have to resort to using out-of-date data to sort of make that work is
> > > just another proof that this is not a good idea in the general case.
> >
> > That's where I disagree, this is not an out-of-date value, this is the
> > last correct one before sharing the cpu
>
> This value is arbitrarily old, so of course it is out of date. This only
> sort of works for tasks that don't change their behaviour. That's true
> for some use-cases, yes, but absolutely not in the general case. How
> can you know that the last correct value before sharing the CPU is still
> valid minutes later? The fact that the system started to be
> overcommitted is a good indication that something has changed, so we
> really can't tell. Also, how is any of this going to work for newly
> created tasks while we're overcommitted for example?
>
> > > > the commit that I mentioned above covers those cases and the task will
> > > > not incorrectly fit to another smaller CPU because its util_est is
> > > > preserved during the overutilized phase
> > >
> > > There are other reasons why a task may look like it fits, e.g. two tasks
> > > coscheduled on a big CPU get 50% util each, then we migrate one away, the
> >
> > 50% of what ?
>
> 50% of SCHED_CAPACITY_SCALE (the above sentence mentions a 'big' CPU, and
> for simplicity I assumed no 'pressure' of any kind).
ok, i missed the big cpu
>
> > not the cpu capacity. I think you miss one piece of the
> > recent pelt behavior here
>
> That could very well be the case, which piece are you thinking of?
The current pelt algorithm track actual cpu utilization and can go
above cpu capacity (but not above 1024) so a task utilization can
become bigger than a little cpu capacity
>
> > I fullygree that when the system os
> > overcommitted the util base task placement is not correct but I also
> > think that feec() can't find a cpu in such case
>
> But why are we even entering feec() then? Isn't this just looking for
> trouble really? As per the example above, task migrations can cause util
> 'gaps' on the source CPU which may make it appear like a good candidate
> from an energy standpoint, but it's all bogus really. And let's not even
> talk about how wrong the EM is going be when simulating a potential task
> migration in the overcommitted case.
>
> > > CPU looks half empty. Is it half empty? We've got no way to tell until
> >
> > The same here, it's not thanks to util_est
>
> And again, an out-of-date util est value is not helpful in the general
> case. It helps certain use-cases, sure, but please let's not promote it
> to a load-bearing construct on top of which we build our entire
> scheduling strategy :-)
>
> > > we see idle time. The current util_avg and old util_est value are just
> > > not helpful, they're both bad signals and we should just discard them.
> > >
> > > So again I do feel like the best way forward would be to change the
> > > nature of the OU threshold to actually ask cpuidle 'when was the last
> > > time there was idle time?' (or possibly cache that in the idle task
> > > directly). And then based on that we can decide whether we want to enter
> > > feec() and do util-based decision, or to kick the push-pull mechanism in
> > > your other patches, things like that. That would solve/avoid the problem
> > > I mentioned in the previous paragraph and make the OU detection more
> > > robust. We could also consider using different thresholds in different
> > > places to re-enable load-balancing earlier, and give up on feec() a bit
> > > later to avoid messing the entire task placement when we're only
> > > transiently OU because of misfit. But eventually, we really need to just
> > > give up on util values altogether when we're really overcommitted, it's
> > > really an invariant we need to keep.
> >
> > For now, I will increase the OU threshold to cpu capacity to reduce
> > the false overutilized state because of misfit tasks which is what I
> > really care about.
>
> Cool, and FWIW I am supportive of making this whole part of the code
> better -- a transient OU state due to misfit does make a mess of things
> and we should indeed be able to do better.
>
> > The redesign of OU will come in a different series
> > as this implies more rework.
>
> Ack, this can be made orthogonal to this work I think.
>
> > IIUC your point, we are more interested
> > by the prev cpu than the current one
>
> Hmm, not sure to understand that part. What do you mean?
As replied to Lukasz, if you want to discard utilization of a trask
you need to check the previous cpu
>
> Thanks,
> Quentin