Re: [PATCH] sched: fix group_entity's share update

From: Vincent Guittot
Date: Mon Dec 19 2016 - 12:37:58 EST


On 16 December 2016 at 09:55, Vincent Guittot
<vincent.guittot@xxxxxxxxxx> wrote:
> On 15 December 2016 at 22:42, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
>>
>> On Thu, Dec 01, 2016 at 05:38:53PM +0100, Vincent Guittot wrote:
>> > The update of the share of a cfs_rq is done when its load_avg is updated
>> > but before the group_entity's load_avg has been updated for the past time
>> > slot. This generates wrong load_avg accounting which can be significant
>> > when small tasks are involved in the scheduling.
>> >
>> > Let take the example of a task TA that is dequeued of its task group TG1.
>> > TA was the only task in TG1 which becomes idle.
>> >
>> > We have the sequence:
>> >
>> > - dequeue_entity TA->se
>> > - update_load_avg(TA->se)
>> > - dequeue_entity_load_avg(TG1->cfs_rq, TA->se)
>> > - account_entity_dequeue(TG1->cfs_rq, TA->se)
>> > TG1->cfs_rq->load.weight = 0
>> > - update_cfs_shares(TG1->cfs_rq)
>> > TG1->se->load.weight is updated with the new share of
>> > cfs_rq. TG1->se->load.weight = 0.
>> > - dequeue_entity TG1->se
>> > - update_load_avg(TG1->se) but its weight is now null so the last time
>> > slot (up to a tick) will be accounted with its new weight (0 in our case)
>> > instead of its real weight. The last time slot is accounted as an idle one
>> > whereas it was a running one.
>> >
>> > If the running time of TA is short enough that no tick happens when it
>> > runs, all running time of TG1->se will be accounted as idle time.
>> >
>> > Instead, we should update the share of a cfs_rq (in fact the weight of its
>> > group entity) only after having updated the load_avg of the group_entity.
>> >
>> > update_cfs_shares() now takes the sched_entity as parameter instead of the
>> > cfs_rq and the weight of the group_entity is updated only once its load_avg
>> > has been synced with current time.
>>
>> Urgh, brain hurt, also those names don't help; s/TG1/A/ s/TA/a/
>>
>> So the problem is that in our for_each_sched_entity(se) loop we end up
>> changing the next se before we get there.
>>
>>
>> root
>> (cfs_rq)
>> \
>> (se)
>> A
>> (cfs_rq)
>> \
>> (se)
>> a
>>
>>
>> Starting at a's se, we update_cfs_shares() on A's cfs_rq, which then
>> updates A's se, which is the next se in our iteration and mucks with
>> state before we get there.
>>
>> So you change update_cfs_shares() to go downward while we go upward,
>> ensuring we only update things that we've finished with.
>
> yes
>
>>
>> Makes sense..
>>
>> > kernel/sched/fair.c | 27 ++++++++++++++++-----------
>> > 1 file changed, 16 insertions(+), 11 deletions(-)
>> >
>> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> > index 18d9e75..19092fa 100644
>> > --- a/kernel/sched/fair.c
>> > +++ b/kernel/sched/fair.c
>> > @@ -2689,15 +2689,18 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
>> >
>> > static inline int throttled_hierarchy(struct cfs_rq *cfs_rq);
>> >
>> > -static void update_cfs_shares(struct cfs_rq *cfs_rq)
>> > +static void update_cfs_shares(struct sched_entity *se)
>> > {
>> > struct task_group *tg;
>> > - struct sched_entity *se;
>> > + struct cfs_rq *cfs_rq = group_cfs_rq(se);
>> > long shares;
>>
>> please keep them ordered by length.
>
> Ok
>
>>
>> >
>> > + if (entity_is_task(se))
>>
>> can be: !cfs_rq, which is the same and we already done that load.
>
> yes. My goal was to keep it more readable about the meaning of the
> test and I was expecting that the compiler would be smart enough to
> use the same one load for both cfs_rq = group_cfs_rq(se) and
> entity_is_task(se)
>
> I can change with !cfs_rq
>
>>
>> > + return;
>> > +
>> > tg = cfs_rq->tg;
>>
>> This load isn't needed here yet, can be moved down a bit.
>
> Indeed
>
>>
>> > - se = tg->se[cpu_of(rq_of(cfs_rq))];
>> > - if (!se || throttled_hierarchy(cfs_rq))
>> > +
>> > + if (throttled_hierarchy(cfs_rq))
>> > return;
>> > #ifndef CONFIG_SMP
>> > if (likely(se->load.weight == tg->shares))
>>
>>
>> > @@ -3583,9 +3588,9 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>> > se->vruntime += cfs_rq->min_vruntime;
>> >
>> > update_load_avg(se, UPDATE_TG);
>> > + update_cfs_shares(se);
>> > enqueue_entity_load_avg(cfs_rq, se);
>> > account_entity_enqueue(cfs_rq, se);
>> > - update_cfs_shares(cfs_rq);
>> >
>> > if (flags & ENQUEUE_WAKEUP)
>> > place_entity(cfs_rq, se, 0);
>>
>> So here we need to update_cfs_shares() _before_ enqueue_entity, because
>> the update_cfs_shares() will affect this se's load, right?
>
> exactly

In fact, the only constraint is that update_cfs_shares() must be done
before account_entity_enqueue(). But there no constraint with
enqueue_entity_load_avg() so it's probably better to put manipulation
of load together and manipulation of weight together:

update_load_avg(se, UPDATE_TG);
enqueue_entity_load_avg(cfs_rq, se);
update_cfs_shares(se);
account_entity_enqueue(cfs_rq, se);

>
>>
>> > @@ -3681,7 +3686,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>> > /* return excess runtime on last dequeue */
>> > return_cfs_rq_runtime(cfs_rq);
>> >
>> > - update_cfs_shares(cfs_rq);
>> > + update_cfs_shares(se);
>> >
>> > /*
>> > * Now advance min_vruntime if @se was the entity holding it back,
>>
>> But this one hurts my brain..
>>
>> It must be done after dequeue_entity_load_avg() such that we subtract
>> the load as was seen until now.
>
> update_cfs_shares(A's se) must be done after update_load_avg(A's se,
> UPDATE_TG); so the update od A's se ->load-avg will be updated with
> the previous load to update load_avg for the previous time slot.
>
> update_cfs_shares(A's se) could be done before or after
> dequeue_entity_load_avg(A's se) because the root's cfs_rq is kept sync
> during the reweight of A's se. Nevertheless, i see one advantage of
> doing that after: reweight_entity will be faster because A's se->on_rq
> will have been cleared in the meantime
>
>>
>> Could we please add comments explaining this ordering, because I forever
>> need to think about this (both enqueue and dequeue).
>
> OK
>
>>
>> > @@ -3864,7 +3869,7 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
>> > * Ensure that runnable average is periodically updated.
>> > */
>> > update_load_avg(curr, UPDATE_TG);
>> > - update_cfs_shares(cfs_rq);
>> > + update_cfs_shares(curr);
>> >
>> > #ifdef CONFIG_SCHED_HRTICK
>> > /*
>> > @@ -4761,7 +4766,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>> > break;
>> >
>> > update_load_avg(se, UPDATE_TG);
>> > - update_cfs_shares(cfs_rq);
>> > + update_cfs_shares(se);
>> > }
>> >
>> > if (!se)
>> > @@ -4820,7 +4825,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>> > break;
>> >
>> > update_load_avg(se, UPDATE_TG);
>> > - update_cfs_shares(cfs_rq);
>> > + update_cfs_shares(se);
>> > }
>> >
>> > if (!se)
>>
>> This has a distinct pattern to it though; should we think about
>> something like: UPDATE_SHARES for update_load_avg() or does that confuse
>> things?
>
> IMHO, keeping update_cfs_shares separated from update_load_avg make it
> clear about when we update the shares and enable some optimization
> like for dequeue_entity
>
>>
>> > @@ -9316,7 +9321,7 @@ int sched_group_set_shares(struct task_group *tg, unsigned long shares)
>> > /* Possible calls to update_curr() need rq clock */
>> > update_rq_clock(rq);
>> > for_each_sched_entity(se)
>> > - update_cfs_shares(group_cfs_rq(se));
>> > + update_cfs_shares(se);
>>
>> Should we not also catch up with our load before we frob the shares?
>
> yes you're right, an update_load_avg is missing
>
>>
>> > raw_spin_unlock_irqrestore(&rq->lock, flags);
>> > }