Re: [v4.8-rc1 Regression] sched/fair: Apply more PELT fixes

From: Joonwoo Park
Date: Tue Oct 18 2016 - 17:58:42 EST




On 10/18/2016 04:56 AM, Vincent Guittot wrote:
Le Tuesday 18 Oct 2016 à 12:34:12 (+0200), Peter Zijlstra a écrit :
On Tue, Oct 18, 2016 at 11:45:48AM +0200, Vincent Guittot wrote:
On 18 October 2016 at 11:07, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
So aside from funny BIOSes, this should also show up when creating
cgroups when you have offlined a few CPUs, which is far more common I'd
think.

The problem is also that the load of the tg->se[cpu] that represents
the tg->cfs_rq[cpu] is initialized to 1024 in:
alloc_fair_sched_group
for_each_possible_cpu(i) {
init_entity_runnable_average(se);
sa->load_avg = scale_load_down(se->load.weight);

Initializing sa->load_avg to 1024 for a newly created task makes
sense as we don't know yet what will be its real load but i'm not sure
that we have to do the same for se that represents a task group. This
load should be initialized to 0 and it will increase when task will be
moved/attached into task group

Yes, I think that makes sense, not sure how horrible that is with the

That should not be that bad because this initial value is only useful for
the few dozens of ms that follow the creation of the task group


current state of things, but after your propagate patch, that
reinstates the interactivity hack that should work for sure.

The patch below fixes the issue on my platform:

Dietmar, Omer can you confirm that this fix the problem of your platform too ?

I just noticed this thread after posting https://lkml.org/lkml/2016/10/18/719...
Noticed this bug while a ago and had the patch above at least a week but unfortunately didn't have time to post...
I think Omer had same problem I was trying to fix and I believe patch I post should address it.

Vincent, your version fixes my test case as well.
This is sched_stat from the same test case I had in my changelog.
Note dd-2030 which is in root cgroup had same runtime as dd-2033 which is in child cgroup.

dd (2030, #threads: 1)
-------------------------------------------------------------------
se.exec_start : 275700.024137
se.vruntime : 10589.114654
se.sum_exec_runtime : 1576.837993
se.nr_migrations : 0
nr_switches : 159
nr_voluntary_switches : 0
nr_involuntary_switches : 159
se.load.weight : 1048576
se.avg.load_sum : 48840575
se.avg.util_sum : 19741820
se.avg.load_avg : 1022
se.avg.util_avg : 413
se.avg.last_update_time : 275700024137
policy : 0
prio : 120
clock-delta : 34
dd (2033, #threads: 1)
-------------------------------------------------------------------
se.exec_start : 275710.037178
se.vruntime : 2383.802868
se.sum_exec_runtime : 1576.547591
se.nr_migrations : 0
nr_switches : 162
nr_voluntary_switches : 0
nr_involuntary_switches : 162
se.load.weight : 1048576
se.avg.load_sum : 48316646
se.avg.util_sum : 21235249
se.avg.load_avg : 1011
se.avg.util_avg : 444
se.avg.last_update_time : 275710037178
policy : 0
prio : 120
clock-delta : 36

Thanks,
Joonwoo


---
kernel/sched/fair.c | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8b03fb5..89776ac 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -690,7 +690,14 @@ void init_entity_runnable_average(struct sched_entity *se)
* will definitely be update (after enqueue).
*/
sa->period_contrib = 1023;
- sa->load_avg = scale_load_down(se->load.weight);
+ /*
+ * Tasks are intialized with full load to be seen as heavy task until
+ * they get a chance to stabilize to their real load level.
+ * group entity are intialized with null load to reflect the fact that
+ * nothing has been attached yet to the task group.
+ */
+ if (entity_is_task(se))
+ sa->load_avg = scale_load_down(se->load.weight);
sa->load_sum = sa->load_avg * LOAD_AVG_MAX;
/*
* At this point, util_avg won't be used in select_task_rq_fair anyway