Re: [patch 1/2] sched: check for prev_cpu == this_cpu inwake_affine()

From: Mike Galbraith
Date: Fri Mar 05 2010 - 14:36:44 EST

On Fri, 2010-03-05 at 10:39 -0800, Suresh Siddha wrote:
> plain text document attachment (fix_wake_affine.patch)
> On a single cpu system with SMT, in the scenario of one SMT thread being
> idle with another SMT thread running a task and doing a non sync wakeup of
> another task, we see (from the traces) that the woken up task ends up running
> on the busy thread instead of the idle thread. Idle balancing that comes
> in little bit later is fixing the scernaio.

Yup, wake_affine() fails for non sync wakeup when 1 task is running.
That's annoying, but making it succeed globally worries me. We need a
high quality hint, and avg_overlap ain't it unfortunately, because to
get accurate overlap info cross cpu, you have to double clock and
update_curr() overhead. We need dirt cheap.

> But fixing this wake balance and running the woken up task directly on the
> idle SMT thread improved the performance (phoronix 7zip compression workload)
> by ~9% on an atom platform.

So there is profit to be had.

> During the process wakeup, select_task_rq_fair() and wake_affine() makes
> the decision to wakeup the task either on the previous cpu that the task
> ran or the cpu that the task is currently woken up.
> select_task_rq_fair() also goes through to see if there are any idle siblings
> for the cpu that the task is woken up on. This is to ensure that we select
> any idle sibling rather than choose a busy cpu.

Yeah, but with the 1 task + non-sync wakeup scenario, we miss the boat
because select_idle_sibling() uses wake_affine() success as it's
enabler. I did that because I couldn't think up something else which
did not harm multiple buddy pairs. You can globally say sibling is
idle, go for it, but that _does_ cause throughput loss during ramp up.

Best alternative I've found is to only check for an idle sibling/cache
when there is exactly one task on the current cpu (ie put some faith in
load balancing), then force idle sibling selection. Also not optimal.

> In the above load scenario, it so happens that the prev_cpu (that the
> task ran before) and this_cpu (where it is woken up currently) are the same. And
> in this case, it looks like wake_affine() returns 0 and ultimately not selecting
> the idle sibling chosen by select_idle_sibling() in select_task_rq_fair().
> Further down the path of select_task_rq_fair(), we ultimately select
> the currently running cpu (busy SMT thread instead of the idle SMT thread).
> Check for prev_cpu == this_cpu in wake_affine() and no need to do
> any fancy stuff(and ultimately taking wrong decisions) in this case.

I have a slightly different patch for that in my tree. There's no need
to even call wake_affine() since the result is meaningless.

kernel/sched_fair.c | 10 ++++++++--
1 file changed, 8 insertions(+), 2 deletions(-)

Index: linux-2.6.34.git/kernel/sched_fair.c
--- linux-2.6.34.git.orig/kernel/sched_fair.c
+++ linux-2.6.34.git/kernel/sched_fair.c
@@ -1547,8 +1547,14 @@ static int select_task_rq_fair(struct ta

- if (affine_sd && wake_affine(affine_sd, p, sync))
- return cpu;
+ if (affine_sd) {
+ if (cpu == prev_cpu)
+ return cpu;
+ if (wake_affine(affine_sd, p, sync))
+ return cpu;
+ if (!(affine_sd->flags & SD_BALANCE_WAKE))
+ return prev_cpu;
+ }

while (sd) {
int load_idx = sd->forkexec_idx;

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at