[PATCH] sched/fair: Do not decay new task load on first enqueue

From: Matt Fleming
Date: Fri Sep 23 2016 - 07:58:42 EST

Next message: Arnd Bergmann: "Re: [PATCH 2/2] armv8: aarch32: Add SMP support for 32-bit Linux kernel"
Previous message: Arvind Yadav: "[PATCH V3] clocksource/timer-imx-gpt: Preventing resource leakage in error case."
Next in thread: Vincent Guittot: "Re: [PATCH] sched/fair: Do not decay new task load on first enqueue"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Since commit 7dc603c9028e ("sched/fair: Fix PELT integrity for new
tasks") ::last_update_time will be set to a non-zero value in
post_init_entity_util_avg(), which leads to p->se.avg.load_avg being
decayed on enqueue before the task has even had a chance to run.

For a NICE_0 task the sequence of events leading up to this with
example load average changes might be,

sched_fork()
init_entity_runnable_average()
p->se.avg.load_avg = scale_load_down(se->load.weight); // 1024

wake_up_new_task()
post_init_entity_util_avg()
attach_entity_load_avg()
p->se.last_update_time = cfs_rq->avg.last_update_time;

activate_task()
enqueue_task()
...
enqueue_entity_load_avg()
migrated = !sa->last_update_time // false
if (!migrated)
__update_load_avg()
p->se.avg.load_avg = 1002

This causes a performance regression for fork intensive workloads like
hackbench. When balancing on fork we can end up picking the same CPU
to enqueue on over and over. This leads to huge congestion when trying
to simultaneously wake up tasks that are all on the same runqueue, and
causes lots of migrations on wake up.

The behaviour since commit 7dc603c9028e essentially defeats the
scheduler's attempt to balance on fork(). Before, ::runnable_load_avg
likely had a non-zero value when the hackbench tasks were dequeued
(the fork()'d tasks immediately block reading on pipe/socket) but now
the load balancer sees the CPU as having no runnable load.

Arguably the real problem is that balancing on fork doesn't look at
the blocked contribution of tasks, only the runnable load and it's
possible for the two metrics to be wildly different on a relatively
idle system.

But it still doesn't seem quite right to update a task's load_avg
before it runs for the first time.

Here are the results of running hackbench before 7dc603c9028e (old
behaviour), with 7dc603c9028e applied (exiting behaviour), and after
7dc603c9028e with this patch on top (new behaviour),

hackbench-process-sockets

4.7.0-rc5 4.7.0-rc5 4.7.0-rc5
before 7dc603c9028e after
Amean 1 0.0611 ( 0.00%) 0.0693 (-13.32%) 0.0600 ( 1.87%)
Amean 4 0.1777 ( 0.00%) 0.1730 ( 2.65%) 0.1790 ( -0.72%)
Amean 7 0.2771 ( 0.00%) 0.2816 ( -1.60%) 0.2741 ( 1.08%)
Amean 12 0.3851 ( 0.00%) 0.4167 ( -8.20%) 0.3751 ( 2.60%)

Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
Cc: Ingo Molnar <mingo@xxxxxxxxxx>
Cc: Mike Galbraith <umgwanakikbuti@xxxxxxxxx>
Cc: Yuyang Du <yuyang.du@xxxxxxxxx>
Cc: Vincent Guittot <vincent.guittot@xxxxxxxxxx>
Cc: Dietmar Eggemann <dietmar.eggemann@xxxxxxx>
Signed-off-by: Matt Fleming <matt@xxxxxxxxxxxxxxxxxxx>
---
kernel/sched/fair.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8fb4d1942c14..4a2d3ff772f8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3142,7 +3142,7 @@ enqueue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
int migrated, decayed;

migrated = !sa->last_update_time;
- if (!migrated) {
+ if (!migrated && se->sum_exec_runtime) {
__update_load_avg(now, cpu_of(rq_of(cfs_rq)), sa,
se->on_rq * scale_load_down(se->load.weight),
cfs_rq->curr == se, NULL);
--
2.10.0

Next message: Arnd Bergmann: "Re: [PATCH 2/2] armv8: aarch32: Add SMP support for 32-bit Linux kernel"
Previous message: Arvind Yadav: "[PATCH V3] clocksource/timer-imx-gpt: Preventing resource leakage in error case."
Next in thread: Vincent Guittot: "Re: [PATCH] sched/fair: Do not decay new task load on first enqueue"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]