Re: [PATCH] sched: Update task group load contributions during active load-balancing

From: bsegall
Date: Wed Sep 17 2014 - 14:06:29 EST


Morten Rasmussen <morten.rasmussen@xxxxxxx> writes:

> On Tue, Sep 16, 2014 at 06:49:14PM +0100, bsegall@xxxxxxxxxx wrote:
>> Morten Rasmussen <morten.rasmussen@xxxxxxx> writes:
>>
>> > Task group load-contributions are not updated when tasks belonging to
>> > task groups are migrated by active load-balancing. If no other task
>> > belonging to the same task group is already queued at the destination
>> > cpu the group sched_entity will be enqueued with load_avg_contrib=0.
>> > Hence, weighted_cpuload() won't reflect the newly added load.
>> >
>> > The load may remain invisible until the next tick, when the sched_entity
>> > load_avg_contrib and task group contributions are reevaluated.
>> >
>> > The enqueue loop
>> >
>> > for_each_entity(se) {
>> > enqueue_entity(cfs_rq,se)
>> > ...
>> > enqueue_entity_load_avg(cfs_rq,se)
>> > ...
>> > update_entity_load_avg(se)
>> > ...
>> > __update_entity_load_avg_contrib(se)
>> > ...
>> > ...
>> > update_cfs_rq_blocked_load(cfs_rq)
>> > ...
>> > __update_cfs_rq_tg_load_contrib(cfs_rq)
>> > ...
>> > }
>> >
>> > currently skips __update_entity_load_avg_contrib() and
>> > __update_cfs_rq_tg_load_contrib() for group entities for active
>> > load-balance migrations. The former updates the sched_entity
>> > load_avg_contrib, and the latter updates the task group contribution
>> > which is needed by the former. They must both be called to ensure that
>> > load doesn't temporarily disappear.
>> >
>> > cc: Paul Turner <pjt@xxxxxxxxxx>
>> > cc: Ben Segall <bsegall@xxxxxxxxxx>
>> >
>> > Signed-off-by: Morten Rasmussen <morten.rasmussen@xxxxxxx>
>> > ---
>> > kernel/sched/fair.c | 7 ++++++-
>> > 1 file changed, 6 insertions(+), 1 deletion(-)
>> >
>> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> > index be9e97b..2b6e2eb 100644
>> > --- a/kernel/sched/fair.c
>> > +++ b/kernel/sched/fair.c
>> > @@ -2521,7 +2521,8 @@ static inline void update_entity_load_avg(struct sched_entity *se,
>> > else
>> > now = cfs_rq_clock_task(group_cfs_rq(se));
>> >
>> > - if (!__update_entity_runnable_avg(now, &se->avg, se->on_rq))
>> > + if (!__update_entity_runnable_avg(now, &se->avg, se->on_rq) &&
>> > + entity_is_task(se))
>> > return;
>> >
>> > contrib_delta = __update_entity_load_avg_contrib(se);
>> > @@ -2609,6 +2610,10 @@ static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
>> > cfs_rq->runnable_load_avg += se->avg.load_avg_contrib;
>> > /* we force update consideration on load-balancer moves */
>> > update_cfs_rq_blocked_load(cfs_rq, !wakeup);
>> > +
>> > + /* We force update group contributions on load-balancer moves */
>> > + if (wakeup && !entity_is_task(se))
>> > + __update_cfs_rq_tg_load_contrib(cfs_rq, 0);
>> > }
>> >
>> > /*
>>
>> It should probably be clearer that what this actually means is that we
>> always update tg_load_contrib for any group-of-group (which is weird),
>
> Would it not only update groups entities if they aren't already on a
> runqueue? I'm clearly missing something.

No, that's correct, but it will do it on normal sleep/wake too, not just
load-balancer moves.

>
>> so that we update the entire tree on load-balancer moves (which is
>> sensible and what we care about, because it determines each se.load_avg_contrib).
>>
>> Wouldn't we need this on dequeue as well?
>
> Yes. The problem is there too I think. It is not as visible though. If
> the last task of a particular group is migrated away from a cpu, the
> group entities will be dequeued and we should be fine. However, if there
> are other tasks left in the group or any parent group the load of that
> group will be off, I think. I need to verify that.

Yeah, the case where it comes up would be different, and possibly less
bad.

>
>>
>> I believe the reason for both of these ratelimits was performance on
>> sleep/wake (where blocked_load_avg means tg_load_contrib should not have
>> changed), it would be good to know how much this hurts. Given the
>> fast-exit in __update_cfs_rq_tg_load_contrib, it seems like it might be
>> fine, but the group-entity __update_entity_load_avg_contrib is less
>> free when div_u64 is expensive, and is currently only called 1/ms.
>
> Right. Is there a suitable group scheduling benchmark I can use to test
> this?

The most severe would probably be pipe test where you moved the threads
to be in separate cgroups. A more realistic one would probably just be
looping around nanosleep(100ns) on a bunch of cpus or something like that.

>
>>
>> It seems like it would be a bit of a mess to force an update of the
>> entire path in migrate, but it would certainly dodge the performance
>> concerns if they wind up being an issue. (I suppose we could have
>> ENQUEUE/DEQUEUE_MIGRATE or something...)
>
> Yes, I had the same thought. We might have to do something like that
> anyway to fix the dequeue path. It seems less straightforward to fix,
> but I need to take a closer look.
>
> Morten
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/