[PATCH 3.4 42/92] sched/core: Fix TASK_DEAD race in finish_task_switch()
Date: Mon Apr 18 2016 - 06:49:30 EST
From: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
3.4.112-rc1 review patch. If anyone has any objections, please let me know.
commit 95913d97914f44db2b81271c2e2ebd4d2ac2df83 upstream.
So the problem this patch is trying to address is as follows:
A->on_cpu == 0
prev_state = A->state <-.
A->on_cpu = 0; |
UNLOCK rq0->lock |
| context_switch(C, A)
`-- A->state = TASK_DEAD
prev_state == TASK_DEAD
A->state == TASK_DEAD
The argument being that the WMB will allow the load of A->state on CPU0
to cross over and observe CPU1's store of A->state, which will then
result in a double-drop and use-after-free.
Now the comment states (and this was true once upon a long time ago)
that we need to observe A->state while holding rq->lock because that
will order us against the wakeup; however the wakeup will not in fact
acquire (that) rq->lock; it takes A->pi_lock these days.
We can obviously fix this by upgrading the WMB to an MB, but that is
expensive, so we'd rather avoid that.
The alternative this patch takes is: smp_store_release(&A->on_cpu, 0),
which avoids the MB on some archs, but not important ones like ARM.
Reported-by: Oleg Nesterov <oleg@xxxxxxxxxx>
Signed-off-by: Peter Zijlstra (Intel) <peterz@xxxxxxxxxxxxx>
Acked-by: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
Cc: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
Fixes: e4a52bcb9a18 ("sched: Remove rq->lock from the first half of ttwu()")
Signed-off-by: Ingo Molnar <mingo@xxxxxxxxxx>
[lizf: Backported to 3.4: use smb_mb() instead of smp_store_release(), which
is not defined in 3.4.y]
Signed-off-by: Zefan Li <lizefan@xxxxxxxxxx>
kernel/sched/core.c | 10 +++++-----
kernel/sched/sched.h | 4 +++-
2 files changed, 8 insertions(+), 6 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 15be435..609a226 100644
@@ -1949,11 +1949,11 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev)
* If a task dies, then it sets TASK_DEAD in tsk->state and calls
* schedule one last time. The schedule call will never return, and
* the scheduled task must drop that reference.
- * The test for TASK_DEAD must occur while the runqueue locks are
- * still held, otherwise prev could be scheduled on another cpu, die
- * there before we look at prev->state, and then the reference would
- * be dropped twice.
- * Manfred Spraul <manfred@xxxxxxxxxxxxxxxx>
+ * We must observe prev->state before clearing prev->on_cpu (in
+ * finish_lock_switch), otherwise a concurrent wakeup can get prev
+ * running on another CPU and we could rave with its RUNNING -> DEAD
+ * transition, resulting in a double drop.
prev_state = prev->state;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 4a5e739..44f4058 100644
@@ -702,8 +702,10 @@ static inline void finish_lock_switch(struct rq *rq, struct task_struct *prev)
* After ->on_cpu is cleared, the task can be moved to a different CPU.
* We must ensure this doesn't happen until the switch is completely
+ * Pairs with the control dependency and rmb in try_to_wake_up().
prev->on_cpu = 0;