[PATCH?] Livelock in pick_next_task_fair() / idle_balance()

From: Rabin Vincent
Date: Tue Jun 30 2015 - 10:31:42 EST


Hi,

We're seeing a livelock where two CPUs both loop with interrupts
disabled in pick_next_task_fair() / idle_balance() and continuously
fetch all tasks from each other. This has been observed on a 3.18
kernel but the relevant code paths appear to be the same in current
mainline.

The problem situation appears to be this:

cpu0 cpu1


pick_next_task
take rq0->lock
pick_next_task_fair
running=0
idle_balance()
drop rq0->lock


wakeup A,B on cpu0


pick_next_task
take rq1->lock
pick_next_task_fair
running=0
idle_balance()
drop r1->lock
load_balance()
busiest=rq0
take rq0->lock
detach A,B
drop rq0->lock
take rq1->lock
attach A,B
pulled_task = 2
drop rq1->lock

load_balance()
busiest=rq1
take rq1->lock
detach A,B
drop rq1->lock
take rq0->lock
attach A,B
pulled_task = 2
drop rq0->lock

running=0()
idle_balance()
busiest=rq0, pull A,B, etc.


running = 0
load_balance()
busiest=rq1, pull A,B, etc

And this goes on, with interrupts disabled on both CPUs, for at least a
100 ms (which is when a hardware watchdog hits).

The conditions needed, apart from the right timing, are:

- cgroups. One of the tasks, say A, needs to be in a CPU cgroup. When
the problem occurs, A's ->se has zero load_avg_contrib and
task_h_load(A) is zero. However, the se->parent->parent of A has a
(relatively) high load_avg_contrib. cpu0's cfs_rq has therefore a
relatively high runnable_load_avg. find_busiest_group() therefore
detects imbalance, and detach_tasks() detaches all tasks.

- PREEMPT=n. Otherwise, the code under #ifdef in detach_tasks() would
ensure that we'd only ever pull a maximum of one task during idle
balancing.

- cpu0 and cpu1 are threads on the same core (cpus_share_cache() ==
true). otherwise, cpu1 will not be able to wakeup tasks on cpu0
while cpu0 has interrupts disabled (since an IPI would be required).
Turning off the default TTWU_QUEUE feature would also provide the
same effect.

I see two simple ways to prevent the livelock. One is to just to remove
the #ifdef:

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3d57cc0..74c94dc 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5937,7 +5937,6 @@ static int detach_tasks(struct lb_env *env)
detached++;
env->imbalance -= load;

-#ifdef CONFIG_PREEMPT
/*
* NEWIDLE balancing is a source of latency, so preemptible
* kernels will stop after the first task is detached to minimize
@@ -5945,7 +5944,6 @@ static int detach_tasks(struct lb_env *env)
*/
if (env->idle == CPU_NEWLY_IDLE)
break;
-#endif

/*
* We only want to steal up to the prescribed amount of

Or this should work too:

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3d57cc0..13358cf 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7487,6 +7487,8 @@ static int idle_balance(struct rq *this_rq)
*/
if (this_rq->cfs.h_nr_running && !pulled_task)
pulled_task = 1;
+ else if (!this_rq->cfs.h_nr_running && pulled_task)
+ pulled_task = 0;

out:
/* Move the next balance forward */

But it seems that the real problem is that detach_tasks() can remove all
tasks, when cgroups are involved as described above?

Thanks.

/Rabin
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/