Re: [PATCH?] Livelock in pick_next_task_fair() / idle_balance()

From: Rabin Vincent
Date: Wed Jul 01 2015 - 10:56:09 EST


On Wed, Jul 01, 2015 at 07:36:35AM +0200, Mike Galbraith wrote:
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5897,7 +5897,7 @@ static int detach_tasks(struct lb_env *e
> {
> struct list_head *tasks = &env->src_rq->cfs_tasks;
> struct task_struct *p;
> - unsigned long load;
> + unsigned long load, d_load = 0, s_load = env->src_rq->load.weight;
> int detached = 0;
>
> lockdep_assert_held(&env->src_rq->lock);
> @@ -5936,6 +5936,11 @@ static int detach_tasks(struct lb_env *e
>
> detached++;
> env->imbalance -= load;
> + if (!load) {
> + load = min_t(unsigned long, env->imbalance, p->se.load.weight);
> + trace_printk("%s:%d is non-contributor - count as %ld\n", p->comm, p->pid, load);
> + }
> + d_load += load;
>
> #ifdef CONFIG_PREEMPT
> /*
> @@ -5954,6 +5959,18 @@ static int detach_tasks(struct lb_env *e
> if (env->imbalance <= 0)
> break;
>
> + /*
> + * We don't want to bleed busiest_rq dry either. Weighted load
> + * and/or imbalance may be dinky, load contribution can even be
> + * zero, perhaps causing us to over balancem we had not assigned
> + * it above.
> + */
> + if (env->src_rq->load.weight <= env->dst_rq->load.weight + d_load) {
> + trace_printk("OINK - imbal: %ld load: %ld run: %d det: %d sload_was: %ld sload_is: %ld dload: %ld\n",
> + env->imbalance, load, env->src_rq->nr_running, detached, s_load, env->src_rq->load.weight, env->dst_rq->load.weight+d_load);
> + break;
> + }
> +
> continue;
> next:
> list_move_tail(&p->se.group_node, tasks);
>

I've tried to analyse how your patch would affect the situation in one
of the crash dumps which I have of the problem.

In this dump, cpu0 is in the middle of dequeuing all tasks from cpu1.
rcu_sched has already been detached and there are two tasks left, one of them
which is being processed by dequeue_entity_load_avg() called from
dequeue_task() at the time the watchdog hits. lb_env is the following.
imbalance is, as you can see, 60.

crash> struct lb_env 8054fd50
struct lb_env {
sd = 0x8fc13e00,
src_rq = 0x81297200,
src_cpu = 1,
dst_cpu = 0,
dst_rq = 0x8128e200,
dst_grpmask = 0x0,
new_dst_cpu = 0,
idle = CPU_NEWLY_IDLE,
imbalance = 60,
cpus = 0x8128d238,
flags = 0,
loop = 2,
loop_break = 32,
loop_max = 3,
fbq_type = all,
tasks = {
next = 0x8fc4c6ec,
prev = 0x8fc4c6ec
}
}

Weights of the runqueues:

crash> struct rq.load.weight runqueues:0,1
[0]: 8128e200
load.weight = 0,
[1]: 81297200
load.weight = 1935,

The only running tasks on the system are these three:

crash> foreach RU ps
PID PPID CPU TASK ST %MEM VSZ RSS COMM
> 0 0 0 8056d8b0 RU 0.0 0 0 [swapper/0]
> 0 0 1 8fc5cd18 RU 0.0 0 0 [swapper/1]
> 0 0 2 8fc5c6b0 RU 0.0 0 0 [swapper/2]
> 0 0 3 8fc5c048 RU 0.0 0 0 [swapper/3]
7 2 0 8fc4c690 RU 0.0 0 0 [rcu_sched]
30 2 1 8fd26108 RU 0.0 0 0 [kswapd0]
413 1 1 8edda408 RU 0.6 1900 416 rngd

And the load.weight and load_avg_contribs for them and their parent SEs:

crash> foreach 7 30 413 load
PID: 7 TASK: 8fc4c690 CPU: 0 COMMAND: "rcu_sched"
task_h_load(): 325 [ = (load_avg_contrib { 5} * cfs_rq->h_load { 65}) / (cfs_rq->runnable_load_avg { 0} + 1) ]
SE: 8fc4c6d8 load_avg_contrib: 5 load.weight: 1024 PARENT: 00000000 GROUPNAME: (null)

PID: 30 TASK: 8fd26108 CPU: 1 COMMAND: "kswapd0"
task_h_load(): 10 [ = (load_avg_contrib { 10} * cfs_rq->h_load { 133}) / (cfs_rq->runnable_load_avg { 128} + 1) ]
SE: 8fd26150 load_avg_contrib: 10 load.weight: 1024 PARENT: 00000000 GROUPNAME: (null)

PID: 413 TASK: 8edda408 CPU: 1 COMMAND: "rngd"
task_h_load(): 0 [ = (load_avg_contrib { 0} * cfs_rq->h_load { 0}) / (cfs_rq->runnable_load_avg { 0} + 1) ]
SE: 8edda450 load_avg_contrib: 0 load.weight: 1024 PARENT: 8fffbd00 GROUPNAME: (null)
SE: 8fffbd00 load_avg_contrib: 0 load.weight: 2 PARENT: 8f531f80 GROUPNAME: rngd@xxxxxxxxxxxxx
SE: 8f531f80 load_avg_contrib: 0 load.weight: 1024 PARENT: 8f456e00 GROUPNAME: system-rngd.slice
SE: 8f456e00 load_avg_contrib: 118 load.weight: 911 PARENT: 00000000 GROUPNAME: system.slice

Given the above, we can see that with your patch:

- dst_rq->load.weight is 0 and will not change in this loop.

- src_rq->load.weight was 1935 + 1024 before the loop. It will go down
to 1935 (already has), 1024, and then 0.

- d_load will be 325*, 335, and then 395.

(* - probably not exactly since rcu_sched has already had set_task_rq() called
[cfs_rq switched] on it, but I guess it's actually going to be much lower based
on the other dumps I see where rcu_sched hasn't be switched yet).

So, we will not hit the "if (env->src_rq->load.weight <=
env->dst_rq->load.weight + d_load)" condition to break out of the loop until we
actualy move all tasks. So the patch will not have any effect on this case.
Or am I missing something?

We'll set up a test anyway with the patch; the problem usually takes a
couple of days to reproduce.

/Rabin
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/