Re: [PATCH -v2 12/18] sched/fair: Rewrite PELT migration propagation
From: Vincent Guittot
Date: Mon Oct 09 2017 - 11:03:41 EST
Hi Peter,
On 1 September 2017 at 15:21, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
> When an entity migrates in (or out) of a runqueue, we need to add (or
> remove) its contribution from the entire PELT hierarchy, because even
> non-runnable entities are included in the load average sums.
>
> In order to do this we have some propagation logic that updates the
> PELT tree, however the way it 'propagates' the runnable (or load)
> change is (more or less):
>
> tg->weight * grq->avg.load_avg
> ge->avg.load_avg = ------------------------------
> tg->load_avg
>
> But that is the expression for ge->weight, and per the definition of
> load_avg:
>
> ge->avg.load_avg := ge->weight * ge->avg.runnable_avg
>
> That destroys the runnable_avg (by setting it to 1) we wanted to
> propagate.
>
> Instead directly propagate runnable_sum.
>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@xxxxxxxxxxxxx>
> ---
> kernel/sched/debug.c | 2
> kernel/sched/fair.c | 186 ++++++++++++++++++++++++++++-----------------------
> kernel/sched/sched.h | 9 +-
> 3 files changed, 112 insertions(+), 85 deletions(-)
>
> --- a/kernel/sched/debug.c
> +++ b/kernel/sched/debug.c
> @@ -565,6 +565,8 @@ void print_cfs_rq(struct seq_file *m, in
> cfs_rq->removed.load_avg);
> SEQ_printf(m, " .%-30s: %ld\n", "removed.util_avg",
> cfs_rq->removed.util_avg);
> + SEQ_printf(m, " .%-30s: %ld\n", "removed.runnable_sum",
> + cfs_rq->removed.runnable_sum);
> #ifdef CONFIG_FAIR_GROUP_SCHED
> SEQ_printf(m, " .%-30s: %lu\n", "tg_load_avg_contrib",
> cfs_rq->tg_load_avg_contrib);
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3330,11 +3330,77 @@ void set_task_rq_fair(struct sched_entit
> se->avg.last_update_time = n_last_update_time;
> }
>
> -/* Take into account change of utilization of a child task group */
> +
> +/*
> + * When on migration a sched_entity joins/leaves the PELT hierarchy, we need to
> + * propagate its contribution. The key to this propagation is the invariant
> + * that for each group:
> + *
> + * ge->avg == grq->avg (1)
> + *
> + * _IFF_ we look at the pure running and runnable sums. Because they
> + * represent the very same entity, just at different points in the hierarchy.
I agree for the running part because only one entity can be running
but i'm not sure for the pure runnable sum because we can have several
runnable task in a cfs_rq but only one runnable group entity to
reflect them
or I misunderstand (1)
As an example, we have 2 always running task TA and TB so their
load_sum is LOAD_AVG_MAX for each task
The grq->avg.load_sum = \Sum se->avg.load_sum = 2*LOAD_AVG_MAX
But
the ge->avg.load_sum will be only LOAD_AVG_MAX
So If we apply directly the d(TB->avg.load_sum) on the group hierachy
and on ge->avg.load_sum in particular, the latter decreases to 0
whereas it should decrease only by half
I have been able to see this wrong behavior with a rt-app json file
so I think that we should instead remove only
delta = se->avg.load_sum / grq->avg.load_sum * ge->avg.load_sum
We don't have grq->avg.load_sum but we can have a rough estimate with
grq->avg.load_avg/grq->weight
> + *
> + *
> + * Per the above update_tg_cfs_util() is trivial (and still 'wrong') and
> + * simply copies the running sum over.
> + *
> + * However, update_tg_cfs_runnable() is more complex. So we have:
> + *
> + * ge->avg.load_avg = ge->load.weight * ge->avg.runnable_avg (2)
> + *
> + * And since, like util, the runnable part should be directly transferable,
> + * the following would _appear_ to be the straight forward approach:
> + *
> + * grq->avg.load_avg = grq->load.weight * grq->avg.running_avg (3)
> + *
> + * And per (1) we have:
> + *
> + * ge->avg.running_avg == grq->avg.running_avg
> + *
> + * Which gives:
> + *
> + * ge->load.weight * grq->avg.load_avg
> + * ge->avg.load_avg = ----------------------------------- (4)
> + * grq->load.weight
> + *
> + * Except that is wrong!
> + *
> + * Because while for entities historical weight is not important and we
> + * really only care about our future and therefore can consider a pure
> + * runnable sum, runqueues can NOT do this.
> + *
> + * We specifically want runqueues to have a load_avg that includes
> + * historical weights. Those represent the blocked load, the load we expect
> + * to (shortly) return to us. This only works by keeping the weights as
> + * integral part of the sum. We therefore cannot decompose as per (3).
> + *
> + * OK, so what then?
> + *
> + *
> + * Another way to look at things is:
> + *
> + * grq->avg.load_avg = \Sum se->avg.load_avg
> + *
> + * Therefore, per (2):
> + *
> + * grq->avg.load_avg = \Sum se->load.weight * se->avg.runnable_avg
> + *
> + * And the very thing we're propagating is a change in that sum (someone
> + * joined/left). So we can easily know the runnable change, which would be, per
> + * (2) the already tracked se->load_avg divided by the corresponding
> + * se->weight.
> + *
> + * Basically (4) but in differential form:
> + *
> + * d(runnable_avg) += se->avg.load_avg / se->load.weight
> + * (5)
> + * ge->avg.load_avg += ge->load.weight * d(runnable_avg)
> + */
> +
[snip]