Re: [PATCH] sched/fair: Do not decay new task load on first enqueue

From: Peter Zijlstra
Date: Wed Sep 28 2016 - 07:19:37 EST


On Wed, Sep 28, 2016 at 12:06:43PM +0100, Dietmar Eggemann wrote:
> On 28/09/16 11:14, Peter Zijlstra wrote:
> > On Fri, Sep 23, 2016 at 12:58:08PM +0100, Matt Fleming wrote:
> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >> index 8fb4d1942c14..4a2d3ff772f8 100644
> >> --- a/kernel/sched/fair.c
> >> +++ b/kernel/sched/fair.c
> >> @@ -3142,7 +3142,7 @@ enqueue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
> >> int migrated, decayed;
> >>
> >> migrated = !sa->last_update_time;
> >> - if (!migrated) {
> >> + if (!migrated && se->sum_exec_runtime) {
> >> __update_load_avg(now, cpu_of(rq_of(cfs_rq)), sa,
> >> se->on_rq * scale_load_down(se->load.weight),
> >> cfs_rq->curr == se, NULL);
> >
> >
> > Hrmm,.. so I see the problem, but I think we're working around it.
> >
> > So the problem is that time moves between wake_up_new_task() doing
> > post_init_entity_util_avg(), which attaches us to the cfs_rq, and
> > activate_task() which enqueues us.
> >
> > Part of the problem is that we do not in fact seem to do
> > update_rq_clock() before post_init_entity_util_avg(), which makes the
> > delta larger than it should be.
>
> Yes, this is what I see as well. I always thought that the update is
> done in task_fork_fair() so it's bounded but as I know now, this update
> is only for the waker. In case the cpu was idle before the delta can be
> pretty big.
>
> > The other problem is that activate_task()->enqueue_task() does do
> > update_rq_clock() (again, after fixing), creating the delta.
>
> Not sure what you mean by 'after fixing' but the se is initialized with
> a possibly stale 'now' value in post_init_entity_util_avg()->
> attach_entity_load_avg() before the clock is updated in
> activate_task()->enqueue_task().

I meant that after I fix the above issue of calling post_init with a
stale clock. So the + update_rq_clock(rq) in the patch.

> > Which suggests we do something like the below (not compile tested or
> > anything, also I ran out of tea again).
>
> I'll give it a try. Plenty of coffee here ...
>
> >
> > While staring at this, I don't think we can still hit
> > vruntime_normalized() with a new task, so I _think_ we can remove that
> > !se->sum_exec_runtime clause there (and rejoice), no?
>
> I'm afraid that with accurate timing we will get the same situation that
> we add and subtract the same amount of load (probably 1024 now and not
> 1002 (or less)) to/from cfs_rq->runnable_load_avg for the initial (fork)
> hackbench run.
> After all, it's 'runnable' based.

The idea was that since we now update rq clock before post_init and then
leave it be, both post_init and enqueue see the exact same timestamp,
and the delta is 0, resulting in no aging.

Or did I fail to make that happen?