Re: [RFC PATCH 4/5] sched/fair: Use EAS also when overutilized
From: Quentin Perret
Date: Thu Oct 03 2024 - 04:21:45 EST
On Thursday 03 Oct 2024 at 08:27:00 (+0200), Vincent Guittot wrote:
> On Tue, 1 Oct 2024 at 19:51, Quentin Perret <qperret@xxxxxxxxxx> wrote:
> > And again, checking that a task fits is broken to start with if we don't
> > know how big the task is. When we have reasons to believe that the util
> > values are no longer correct (and the absence of idle time is a very
> > good reason for that) we just need to give up on them. The fact that we
> > have to resort to using out-of-date data to sort of make that work is
> > just another proof that this is not a good idea in the general case.
>
> That's where I disagree, this is not an out-of-date value, this is the
> last correct one before sharing the cpu
This value is arbitrarily old, so of course it is out of date. This only
sort of works for tasks that don't change their behaviour. That's true
for some use-cases, yes, but absolutely not in the general case. How
can you know that the last correct value before sharing the CPU is still
valid minutes later? The fact that the system started to be
overcommitted is a good indication that something has changed, so we
really can't tell. Also, how is any of this going to work for newly
created tasks while we're overcommitted for example?
> > > the commit that I mentioned above covers those cases and the task will
> > > not incorrectly fit to another smaller CPU because its util_est is
> > > preserved during the overutilized phase
> >
> > There are other reasons why a task may look like it fits, e.g. two tasks
> > coscheduled on a big CPU get 50% util each, then we migrate one away, the
>
> 50% of what ?
50% of SCHED_CAPACITY_SCALE (the above sentence mentions a 'big' CPU, and
for simplicity I assumed no 'pressure' of any kind).
> not the cpu capacity. I think you miss one piece of the
> recent pelt behavior here
That could very well be the case, which piece are you thinking of?
> I fullygree that when the system os
> overcommitted the util base task placement is not correct but I also
> think that feec() can't find a cpu in such case
But why are we even entering feec() then? Isn't this just looking for
trouble really? As per the example above, task migrations can cause util
'gaps' on the source CPU which may make it appear like a good candidate
from an energy standpoint, but it's all bogus really. And let's not even
talk about how wrong the EM is going be when simulating a potential task
migration in the overcommitted case.
> > CPU looks half empty. Is it half empty? We've got no way to tell until
>
> The same here, it's not thanks to util_est
And again, an out-of-date util est value is not helpful in the general
case. It helps certain use-cases, sure, but please let's not promote it
to a load-bearing construct on top of which we build our entire
scheduling strategy :-)
> > we see idle time. The current util_avg and old util_est value are just
> > not helpful, they're both bad signals and we should just discard them.
> >
> > So again I do feel like the best way forward would be to change the
> > nature of the OU threshold to actually ask cpuidle 'when was the last
> > time there was idle time?' (or possibly cache that in the idle task
> > directly). And then based on that we can decide whether we want to enter
> > feec() and do util-based decision, or to kick the push-pull mechanism in
> > your other patches, things like that. That would solve/avoid the problem
> > I mentioned in the previous paragraph and make the OU detection more
> > robust. We could also consider using different thresholds in different
> > places to re-enable load-balancing earlier, and give up on feec() a bit
> > later to avoid messing the entire task placement when we're only
> > transiently OU because of misfit. But eventually, we really need to just
> > give up on util values altogether when we're really overcommitted, it's
> > really an invariant we need to keep.
>
> For now, I will increase the OU threshold to cpu capacity to reduce
> the false overutilized state because of misfit tasks which is what I
> really care about.
Cool, and FWIW I am supportive of making this whole part of the code
better -- a transient OU state due to misfit does make a mess of things
and we should indeed be able to do better.
> The redesign of OU will come in a different series
> as this implies more rework.
Ack, this can be made orthogonal to this work I think.
> IIUC your point, we are more interested
> by the prev cpu than the current one
Hmm, not sure to understand that part. What do you mean?
Thanks,
Quentin