Re: [PATCH 10/24] sched/uclamg: Handle delayed dequeue

From: Peter Zijlstra
Date: Thu Sep 05 2024 - 10:55:23 EST


On Thu, Sep 05, 2024 at 04:07:01PM +0200, Dietmar Eggemann wrote:

> > Unfortunately, this is not only about util_est
> >
> > cfs_rq's runnable_avg is also wrong because we normally have :
> > cfs_rq's runnable_avg == /Sum se's runnable_avg
> > but cfs_rq's runnable_avg uses cfs_rq's h_nr_running but delayed
> > entities are still accounted in h_nr_running
>
> Yes, I agree.
>
> se's runnable_avg should be fine already since:
>
> se_runnable()
>
> if (se->sched_delayed)
> return false
>
> But then, like you said, __update_load_avg_cfs_rq() needs correct
> cfs_rq->h_nr_running.

Uff. So yes __update_load_avg_cfs_rq() needs a different number, but
I'll contest that h_nr_running is in fact correct, albeit no longer
suitable for this purpose.

We can track h_nr_delayed I suppose, and subtract that.

> And I guess we need something like:
>
> se_on_rq()
>
> if (se->sched_delayed)
> return false
>
> for
>
> __update_load_avg_se()
>
> - if (___update_load_sum(now, &se->avg, !!se->on_rq, se_runnable(se),
> + if (___update_load_sum(now, &se->avg, se_on_rq(se), se_runnable(se),
>
>
> My hope was we can fix util_est independently since it drives CPU
> frequency. Whereas PELT load_avg and runnable_avg are "only" used for
> load balancing. But I agree, it has to be fixed as well.
>
> > That also means that cfs_rq's h_nr_running is not accurate anymore
> > because it includes delayed dequeue
>
> +1
>
> > and cfs_rq load_avg is kept artificially high which biases
> > load_balance and cgroup's shares
>
> +1

Again, fundamentally the delayed tasks are delayed because they need to
remain part of the competition in order to 'earn' time. It really is
fully on_rq, and should be for the purpose of load and load-balancing.

It is only special in that it will never run again (until it gets
woken).

Consider (2 CPUs, 4 tasks):

CPU1 CPU2
A D
B (delayed)
C

Then migrating any one of the tasks on CPU1 to CPU2 will make them all
earn time at 1/2 instead of 1/3 vs 1/1. More fair etc.

Yes, I realize this might seem weird, but we're going to be getting a
ton more of this weirdness once proxy execution lands, then we'll be
having the entire block chain still on the runqueue (and actually
consuming time).