Re: [PATCH -v2 12/18] sched/fair: Rewrite PELT migration propagation

From: Vincent Guittot
Date: Mon Oct 09 2017 - 11:29:32 EST


On 9 October 2017 at 17:03, Vincent Guittot <vincent.guittot@xxxxxxxxxx> wrote:
> Hi Peter,
>
> On 1 September 2017 at 15:21, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
>> When an entity migrates in (or out) of a runqueue, we need to add (or
>> remove) its contribution from the entire PELT hierarchy, because even
>> non-runnable entities are included in the load average sums.
>>
>> In order to do this we have some propagation logic that updates the
>> PELT tree, however the way it 'propagates' the runnable (or load)
>> change is (more or less):
>>
>> tg->weight * grq->avg.load_avg
>> ge->avg.load_avg = ------------------------------
>> tg->load_avg
>>
>> But that is the expression for ge->weight, and per the definition of
>> load_avg:
>>
>> ge->avg.load_avg := ge->weight * ge->avg.runnable_avg
>>
>> That destroys the runnable_avg (by setting it to 1) we wanted to
>> propagate.
>>
>> Instead directly propagate runnable_sum.
>>
>> Signed-off-by: Peter Zijlstra (Intel) <peterz@xxxxxxxxxxxxx>
>> ---
>> kernel/sched/debug.c | 2
>> kernel/sched/fair.c | 186 ++++++++++++++++++++++++++++-----------------------
>> kernel/sched/sched.h | 9 +-
>> 3 files changed, 112 insertions(+), 85 deletions(-)
>>
>> --- a/kernel/sched/debug.c
>> +++ b/kernel/sched/debug.c
>> @@ -565,6 +565,8 @@ void print_cfs_rq(struct seq_file *m, in
>> cfs_rq->removed.load_avg);
>> SEQ_printf(m, " .%-30s: %ld\n", "removed.util_avg",
>> cfs_rq->removed.util_avg);
>> + SEQ_printf(m, " .%-30s: %ld\n", "removed.runnable_sum",
>> + cfs_rq->removed.runnable_sum);
>> #ifdef CONFIG_FAIR_GROUP_SCHED
>> SEQ_printf(m, " .%-30s: %lu\n", "tg_load_avg_contrib",
>> cfs_rq->tg_load_avg_contrib);
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -3330,11 +3330,77 @@ void set_task_rq_fair(struct sched_entit
>> se->avg.last_update_time = n_last_update_time;
>> }
>>
>> -/* Take into account change of utilization of a child task group */
>> +
>> +/*
>> + * When on migration a sched_entity joins/leaves the PELT hierarchy, we need to
>> + * propagate its contribution. The key to this propagation is the invariant
>> + * that for each group:
>> + *
>> + * ge->avg == grq->avg (1)
>> + *
>> + * _IFF_ we look at the pure running and runnable sums. Because they
>> + * represent the very same entity, just at different points in the hierarchy.
>
> I agree for the running part because only one entity can be running
> but i'm not sure for the pure runnable sum because we can have several
> runnable task in a cfs_rq but only one runnable group entity to
> reflect them
> or I misunderstand (1)
>
> As an example, we have 2 always running task TA and TB so their
> load_sum is LOAD_AVG_MAX for each task
> The grq->avg.load_sum = \Sum se->avg.load_sum = 2*LOAD_AVG_MAX
> But
> the ge->avg.load_sum will be only LOAD_AVG_MAX
>
> So If we apply directly the d(TB->avg.load_sum) on the group hierachy
> and on ge->avg.load_sum in particular, the latter decreases to 0
> whereas it should decrease only by half
>
> I have been able to see this wrong behavior with a rt-app json file
>
> so I think that we should instead remove only
>
> delta = se->avg.load_sum / grq->avg.load_sum * ge->avg.load_sum

delta = se->avg.load_sum / (grq->avg.load_sum+se->avg.load_sum) *
ge->avg.load_sum

as the se has already been detached

> We don't have grq->avg.load_sum but we can have a rough estimate with
> grq->avg.load_avg/grq->weight
>
>
>
>> + *
>> + *
>> + * Per the above update_tg_cfs_util() is trivial (and still 'wrong') and
>> + * simply copies the running sum over.
>> + *
>> + * However, update_tg_cfs_runnable() is more complex. So we have:
>> + *
>> + * ge->avg.load_avg = ge->load.weight * ge->avg.runnable_avg (2)
>> + *
>> + * And since, like util, the runnable part should be directly transferable,
>> + * the following would _appear_ to be the straight forward approach:
>> + *
>> + * grq->avg.load_avg = grq->load.weight * grq->avg.running_avg (3)
>> + *
>> + * And per (1) we have:
>> + *
>> + * ge->avg.running_avg == grq->avg.running_avg
>> + *
>> + * Which gives:
>> + *
>> + * ge->load.weight * grq->avg.load_avg
>> + * ge->avg.load_avg = ----------------------------------- (4)
>> + * grq->load.weight
>> + *
>> + * Except that is wrong!
>> + *
>> + * Because while for entities historical weight is not important and we
>> + * really only care about our future and therefore can consider a pure
>> + * runnable sum, runqueues can NOT do this.
>> + *
>> + * We specifically want runqueues to have a load_avg that includes
>> + * historical weights. Those represent the blocked load, the load we expect
>> + * to (shortly) return to us. This only works by keeping the weights as
>> + * integral part of the sum. We therefore cannot decompose as per (3).
>> + *
>> + * OK, so what then?
>> + *
>> + *
>> + * Another way to look at things is:
>> + *
>> + * grq->avg.load_avg = \Sum se->avg.load_avg
>> + *
>> + * Therefore, per (2):
>> + *
>> + * grq->avg.load_avg = \Sum se->load.weight * se->avg.runnable_avg
>> + *
>> + * And the very thing we're propagating is a change in that sum (someone
>> + * joined/left). So we can easily know the runnable change, which would be, per
>> + * (2) the already tracked se->load_avg divided by the corresponding
>> + * se->weight.
>> + *
>> + * Basically (4) but in differential form:
>> + *
>> + * d(runnable_avg) += se->avg.load_avg / se->load.weight
>> + * (5)
>> + * ge->avg.load_avg += ge->load.weight * d(runnable_avg)
>> + */
>> +
>
> [snip]