Re: [PATCH] sched/fair: Do not decay new task load on first enqueue

From: Dietmar Eggemann
Date: Wed Sep 28 2016 - 07:07:02 EST

On 28/09/16 11:14, Peter Zijlstra wrote:
> On Fri, Sep 23, 2016 at 12:58:08PM +0100, Matt Fleming wrote:
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 8fb4d1942c14..4a2d3ff772f8 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -3142,7 +3142,7 @@ enqueue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
>> int migrated, decayed;
>> migrated = !sa->last_update_time;
>> - if (!migrated) {
>> + if (!migrated && se->sum_exec_runtime) {
>> __update_load_avg(now, cpu_of(rq_of(cfs_rq)), sa,
>> se->on_rq * scale_load_down(se->load.weight),
>> cfs_rq->curr == se, NULL);
> Hrmm,.. so I see the problem, but I think we're working around it.
> So the problem is that time moves between wake_up_new_task() doing
> post_init_entity_util_avg(), which attaches us to the cfs_rq, and
> activate_task() which enqueues us.
> Part of the problem is that we do not in fact seem to do
> update_rq_clock() before post_init_entity_util_avg(), which makes the
> delta larger than it should be.

Yes, this is what I see as well. I always thought that the update is
done in task_fork_fair() so it's bounded but as I know now, this update
is only for the waker. In case the cpu was idle before the delta can be
pretty big.

> The other problem is that activate_task()->enqueue_task() does do
> update_rq_clock() (again, after fixing), creating the delta.

Not sure what you mean by 'after fixing' but the se is initialized with
a possibly stale 'now' value in post_init_entity_util_avg()->
attach_entity_load_avg() before the clock is updated in

> Which suggests we do something like the below (not compile tested or
> anything, also I ran out of tea again).

I'll give it a try. Plenty of coffee here ...

> While staring at this, I don't think we can still hit
> vruntime_normalized() with a new task, so I _think_ we can remove that
> !se->sum_exec_runtime clause there (and rejoice), no?

I'm afraid that with accurate timing we will get the same situation that
we add and subtract the same amount of load (probably 1024 now and not
1002 (or less)) to/from cfs_rq->runnable_load_avg for the initial (fork)
hackbench run.
After all, it's 'runnable' based.