Re: [RFC PATCH 4/5] sched/fair: Use EAS also when overutilized
From: Vincent Guittot
Date: Thu Oct 03 2024 - 02:27:19 EST
On Tue, 1 Oct 2024 at 19:51, Quentin Perret <qperret@xxxxxxxxxx> wrote:
>
> On Tuesday 01 Oct 2024 at 18:20:03 (+0200), Vincent Guittot wrote:
> > With commit 50181c0cff31 ("sched/pelt: Avoid underestimation of task
> > utilization"), the util_est remains set the value before having to
> > share the cpu with other tasks which means that the util_est remains
> > correct even if its util_avg decrease because of sharing the cpu with
> > other task. This has been done to cover the cases that you mention
> > above whereboth util_avg and util_est where decreasing when tasks
> > starts to share the CPU bandwidth with others
>
> I don't think I agree about the correctness of that util_est value at
> all. The above patch only makes it arbitrarily out of date in the truly
> overcommitted case. All the util-based heuristic we have in the
> scheduler are based around the assumption that the close future will
> look like the recent past, so using an arbitrarily old util-est is still
> incorrect. I can understand how this may work OK in RT-app or other
This fixes a real use case on android device
> use-cases with perfectly periodic tasks for their entire lifetime and
> such, but this doesn't work at all in the general case.
>
> > And feec() will return -1 for that case because util_est remains high
>
> And again, checking that a task fits is broken to start with if we don't
> know how big the task is. When we have reasons to believe that the util
> values are no longer correct (and the absence of idle time is a very
> good reason for that) we just need to give up on them. The fact that we
> have to resort to using out-of-date data to sort of make that work is
> just another proof that this is not a good idea in the general case.
That's where I disagree, this is not an out-of-date value, this is the
last correct one before sharing the cpu
>
> > the commit that I mentioned above covers those cases and the task will
> > not incorrectly fit to another smaller CPU because its util_est is
> > preserved during the overutilized phase
>
> There are other reasons why a task may look like it fits, e.g. two tasks
> coscheduled on a big CPU get 50% util each, then we migrate one away, the
50% of what ? not the cpu capacity. I think you miss one piece of the
recent pelt behavior here. I fullygree that when the system os
overcommitted the util base task placement is not correct but I also
think that feec() can't find a cpu in such case
> CPU looks half empty. Is it half empty? We've got no way to tell until
The same here, it's not thanks to util_est
> we see idle time. The current util_avg and old util_est value are just
> not helpful, they're both bad signals and we should just discard them.
>
> So again I do feel like the best way forward would be to change the
> nature of the OU threshold to actually ask cpuidle 'when was the last
> time there was idle time?' (or possibly cache that in the idle task
> directly). And then based on that we can decide whether we want to enter
> feec() and do util-based decision, or to kick the push-pull mechanism in
> your other patches, things like that. That would solve/avoid the problem
> I mentioned in the previous paragraph and make the OU detection more
> robust. We could also consider using different thresholds in different
> places to re-enable load-balancing earlier, and give up on feec() a bit
> later to avoid messing the entire task placement when we're only
> transiently OU because of misfit. But eventually, we really need to just
> give up on util values altogether when we're really overcommitted, it's
> really an invariant we need to keep.
For now, I will increase the OU threshold to cpu capacity to reduce
the false overutilized state because of misfit tasks which is what I
really care about. The redesign of OU will come in a different series
as this implies more rework. IIUC your point, we are more interested
by the prev cpu than the current one
>
> Thanks,
> Quentin