Re: [PATCH] sched/fair: Do not decay new task load on first enqueue

From: Vincent Guittot
Date: Wed Sep 28 2016 - 09:13:25 EST


Le Wednesday 28 Sep 2016 à 05:27:54 (-0700), Vincent Guittot a écrit :
> On 28 September 2016 at 04:31, Dietmar Eggemann
> <dietmar.eggemann@xxxxxxx> wrote:
> > On 28/09/16 12:19, Peter Zijlstra wrote:
> >> On Wed, Sep 28, 2016 at 12:06:43PM +0100, Dietmar Eggemann wrote:
> >>> On 28/09/16 11:14, Peter Zijlstra wrote:
> >>>> On Fri, Sep 23, 2016 at 12:58:08PM +0100, Matt Fleming wrote:
> >
> > [...]
> >
> >>> Not sure what you mean by 'after fixing' but the se is initialized with
> >>> a possibly stale 'now' value in post_init_entity_util_avg()->
> >>> attach_entity_load_avg() before the clock is updated in
> >>> activate_task()->enqueue_task().
> >>
> >> I meant that after I fix the above issue of calling post_init with a
> >> stale clock. So the + update_rq_clock(rq) in the patch.
> >
> > OK.
> >
> > [...]
> >
> >>>> While staring at this, I don't think we can still hit
> >>>> vruntime_normalized() with a new task, so I _think_ we can remove that
> >>>> !se->sum_exec_runtime clause there (and rejoice), no?
> >>>
> >>> I'm afraid that with accurate timing we will get the same situation that
> >>> we add and subtract the same amount of load (probably 1024 now and not
> >>> 1002 (or less)) to/from cfs_rq->runnable_load_avg for the initial (fork)
> >>> hackbench run.
> >>> After all, it's 'runnable' based.
> >>
> >> The idea was that since we now update rq clock before post_init and then
> >> leave it be, both post_init and enqueue see the exact same timestamp,
> >> and the delta is 0, resulting in no aging.
> >>
> >> Or did I fail to make that happen?
> >
> > No, but IMHO what Matt wants is ageing for the hackench tasks at the end
> > of their fork phase so there is a tiny amount of
> > cfs_rq->runnable_load_avg left on cpuX after the fork related dequeue so
> > the (load-based) fork-balancer chooses cpuY for the next hackbench task.
> > That's why he wanted to avoid the __update_load_avg(se) on enqueue (thus
> > adding 1024 to cfs_rq->runnable_load_avg) and do the ageing only on
> > dequeue (removing <1024 from cfs_rq->runnable_load_avg).
>
> wanting cfs_rq->runnable_load_avg to be not null when nothing is
> runnable on the cfs_rq seems a bit odd.
> We should better take into account cfs_rq->avg.load_avg or the
> cfs_rq->avg.util_avg in the select_idlest_group in this case

IIUC the problem raised by Matt, he see a regression because we now remove
during the dequeue the exact same load as during the enqueue so
cfs_rq->runnable_load_avg is null so we select a cfs_rq that might already have
a lot of hackbench blocked thread.
The fact that runnable_load_avg is null, when the cfs_rq doesn't have runnable
task, is quite correct and we should keep it. But when we look for the idlest
group, we have to take into account the blocked thread.

That's what i have tried to do below


---
kernel/sched/fair.c | 30 +++++++++++++++++++++++-------
1 file changed, 23 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 06b3c47..702915e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5353,7 +5353,8 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
int this_cpu, int sd_flag)
{
struct sched_group *idlest = NULL, *group = sd->groups;
- unsigned long min_load = ULONG_MAX, this_load = 0;
+ unsigned long min_runnable_load = ULONG_MAX, this_load = 0;
+ unsigned long min_avg_load = ULONG_MAX;
int load_idx = sd->forkexec_idx;
int imbalance = 100 + (sd->imbalance_pct-100)/2;

@@ -5361,7 +5362,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
load_idx = sd->wake_idx;

do {
- unsigned long load, avg_load;
+ unsigned long load, avg_load, runnable_load;
int local_group;
int i;

@@ -5375,6 +5376,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,

/* Tally up the load of all CPUs in the group */
avg_load = 0;
+ runnable_load = 0;

for_each_cpu(i, sched_group_cpus(group)) {
/* Bias balancing toward cpus of our domain */
@@ -5383,21 +5385,35 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
else
load = target_load(i, load_idx);

- avg_load += load;
+ runnable_load += load;
+
+ avg_load += cfs_rq_load_avg(&cpu_rq(i)->cfs);
}

/* Adjust by relative CPU capacity of the group */
avg_load = (avg_load * SCHED_CAPACITY_SCALE) / group->sgc->capacity;
+ runnable_load = (runnable_load * SCHED_CAPACITY_SCALE) / group->sgc->capacity;

if (local_group) {
- this_load = avg_load;
- } else if (avg_load < min_load) {
- min_load = avg_load;
+ this_load = runnable_load;
+ } else if (runnable_load < min_runnable_load) {
+ min_runnable_load = runnable_load;
+ min_avg_load = avg_load;
+ idlest = group;
+ } else if ((runnable_load == min_runnable_load) && (avg_load < min_avg_load)) {
+ /*
+ * In case that we have same runnable load (especially null
+ * runnable load), we select the group with smallest blocked
+ * load
+ */
+ min_avg_load = avg_load;
+ min_runnable_load = runnable_load;
idlest = group;
}
+
} while (group = group->next, group != sd->groups);

- if (!idlest || 100*this_load < imbalance*min_load)
+ if (!idlest || 100*this_load < imbalance*min_runnable_load)
return NULL;
return idlest;
}


>
> >
> >