Re: [rfc patch] sched/fair: Use instantaneous load for fork/exec balancing

From: Dietmar Eggemann
Date: Mon Jul 11 2016 - 04:58:59 EST


On 04/07/16 16:04, Matt Fleming wrote:
> On Wed, 15 Jun, at 04:32:58PM, Dietmar Eggemann wrote:
>> On 14/06/16 17:40, Mike Galbraith wrote:
>>> On Tue, 2016-06-14 at 15:14 +0100, Dietmar Eggemann wrote:
>>>
>>>> IMHO, the hackbench performance "boost" w/o 0905f04eb21f is due to the
>>>> fact that a new task gets all it's load decayed (making it a small task)
>>>> in the __update_load_avg() call in remove_entity_load_avg() because its
>>>> se->avg.last_update_time value is 0 which creates a huge time difference
>>>> comparing it to cfs_rq->avg.last_update_time. The patch 0905f04eb21f
>>>> avoids this and thus the task stays big se->avg.load_avg = 1024.
>>>
>>> I don't care much at all about the hackbench "regression" in its own
>>> right, and what causes it, for me, bottom line is that there are cases
>>> where we need to be able to resolve, and can't, simply because we're
>>> looking at a fuzzy (rippling) reflection.
>>
>> Understood. I just thought it would be nice to know why 0905f04eb21f
>> makes this problem even more visible. But so far I wasn't able to figure
>> out why this diff in se->avg.load_avg [1024 versus 0] has this effect on
>> cfs_rq->runnable_load_avg making it even less suitable in find idlest*.
>> enqueue_entity_load_avg()'s cfs_rq->runnable_load_* += sa->load_* looks
>> suspicious though.
>
> In my testing without 0905f04eb21f I saw that se->avg.load_avg
> actually managed to skip being decayed at all before the task was
> dequeued, which meant that cfs_rq->runnable_load_avg was more likely
> to be zero after dequeue, for those workloads like hackbench that
> essentially are just a fork bomb.

Do you mean the first dequeue when the task is forked?

These are the pelt related functions which are called when the task is
forked:

detach_entity_load_avg
attach_entity_load_avg
remove_entity_load_avg <-- se->avg.load_avg is set to 0 w/o 0905f04eb21f
se->avg.load_avg stays 1024 w/ 0905f04eb21f
enqueue_entity_load_avg
attach_entity_load_avg (double attach is fixed on tip/sched/core)
dequeue_entity_load_avg

> se->avg.load_avg evaded decay because se->avg.period_contrib was being
> zero'd in __update_load_avg().

I don't see the relation to se->avg.period_contrib here. IMHO,
se->avg.period_contrib is purely there to manage the 3 different update
phases in __update_load_avg().

This difference in the initial se->avg.load_avg value [0 or 1024] has an
influence in wake_affine() [weight = p->se.avg.load_avg;] for the wakeup
handling of the hackbench tasks in the 'send/receive data' phase.

There are a couple of patches on tip/sched/core which might change the
behaviour of this: fork path, no double attach_entity_load_avg for new
task, no remove_entity_load_avg for new task, changes in effective_load ...

> With 0905f04eb21f applied, it's less likely (though not impossible)
> that ->period_contrib will be zero'd and so we usually end up with
> some residual load in cfs_rq->runnable_load_avg on dequeue, and hence,
>
> cfs_rq->runnable_load_avg > se->avg.load_avg
>
> even if 'se' is the only task on the runqueue.
>
> FYI, below is my quick and dirty hack that restored hackbench
> performance for the few machines I checked. I didn't try schbench with
> it.
>
> ---
>
> From 4e9856ea3dc56e356195ca035dab7302754ce59b Mon Sep 17 00:00:00 2001
> From: Matt Fleming <matt@xxxxxxxxxxxxxxxxxxx>
> Date: Thu, 9 Jun 2016 19:48:14 +0100
> Subject: [PATCH] sched/fair: Reset ::runnable_load_avg when dequeueing last
> entity
>
> The task and runqueue load averages maintained in p->se.avg.load_avg
> and cfs_rq->runnable_load_avg respectively, can decay at different
> wall clock rates, which means that enqueueing and then dequeueing a
> task on an otherwise empty runqueue doesn't always leave
> ::runnable_load_avg with its initial value.
>
> This can lead to the situation where cfs_rq->runnable_load_avg has a
> non-zero value even though there are no runnable entities on the
> runqueue. Assuming no entity is enqueued on this runqueue for some
> time this residual load average will decay gradually as the load
> averages are updated.
>
> But we can optimise the special case of dequeueing the last entity and
> reset ::runnable_load_avg early, which gives a performance improvement
> to workloads that trigger the load balancer, such as fork-heavy
> applications when SD_BALANCE_FORK is set, because it gives a more up
> to date view of how busy the cpu is.
>
> Signed-off-by: Matt Fleming <matt@xxxxxxxxxxxxxxxxxxx>
> ---
> kernel/sched/fair.c | 14 ++++++++++++--
> 1 file changed, 12 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index c6dd8bab010c..408ee90c7ea8 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3007,10 +3007,20 @@ enqueue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
> static inline void
> dequeue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
> {
> + unsigned long load_avg = 0;
> +
> update_load_avg(se, 1);
>
> - cfs_rq->runnable_load_avg =
> - max_t(long, cfs_rq->runnable_load_avg - se->avg.load_avg, 0);
> + /*
> + * If we're about to dequeue the last runnable entity we can
> + * reset the runnable load average to zero instead of waiting
> + * for it to decay naturally. This gives the load balancer a
> + * more timely and accurate view of how busy this cpu is.
> + */
> + if (cfs_rq->nr_running > 1)
> + load_avg = max_t(long, cfs_rq->runnable_load_avg - se->avg.load_avg, 0);
> +
> + cfs_rq->runnable_load_avg = load_avg;
> cfs_rq->runnable_load_sum =
> max_t(s64, cfs_rq->runnable_load_sum - se->avg.load_sum, 0);
> }
>