Re: [RFC PATCH 4/5] sched/fair: Use EAS also when overutilized
From: Quentin Perret
Date: Tue Oct 01 2024 - 13:51:42 EST
On Tuesday 01 Oct 2024 at 18:20:03 (+0200), Vincent Guittot wrote:
> With commit 50181c0cff31 ("sched/pelt: Avoid underestimation of task
> utilization"), the util_est remains set the value before having to
> share the cpu with other tasks which means that the util_est remains
> correct even if its util_avg decrease because of sharing the cpu with
> other task. This has been done to cover the cases that you mention
> above whereboth util_avg and util_est where decreasing when tasks
> starts to share the CPU bandwidth with others
I don't think I agree about the correctness of that util_est value at
all. The above patch only makes it arbitrarily out of date in the truly
overcommitted case. All the util-based heuristic we have in the
scheduler are based around the assumption that the close future will
look like the recent past, so using an arbitrarily old util-est is still
incorrect. I can understand how this may work OK in RT-app or other
use-cases with perfectly periodic tasks for their entire lifetime and
such, but this doesn't work at all in the general case.
> And feec() will return -1 for that case because util_est remains high
And again, checking that a task fits is broken to start with if we don't
know how big the task is. When we have reasons to believe that the util
values are no longer correct (and the absence of idle time is a very
good reason for that) we just need to give up on them. The fact that we
have to resort to using out-of-date data to sort of make that work is
just another proof that this is not a good idea in the general case.
> the commit that I mentioned above covers those cases and the task will
> not incorrectly fit to another smaller CPU because its util_est is
> preserved during the overutilized phase
There are other reasons why a task may look like it fits, e.g. two tasks
coscheduled on a big CPU get 50% util each, then we migrate one away, the
CPU looks half empty. Is it half empty? We've got no way to tell until
we see idle time. The current util_avg and old util_est value are just
not helpful, they're both bad signals and we should just discard them.
So again I do feel like the best way forward would be to change the
nature of the OU threshold to actually ask cpuidle 'when was the last
time there was idle time?' (or possibly cache that in the idle task
directly). And then based on that we can decide whether we want to enter
feec() and do util-based decision, or to kick the push-pull mechanism in
your other patches, things like that. That would solve/avoid the problem
I mentioned in the previous paragraph and make the OU detection more
robust. We could also consider using different thresholds in different
places to re-enable load-balancing earlier, and give up on feec() a bit
later to avoid messing the entire task placement when we're only
transiently OU because of misfit. But eventually, we really need to just
give up on util values altogether when we're really overcommitted, it's
really an invariant we need to keep.
Thanks,
Quentin