[ 125/127] sched: Fix ancient race in do_exit()

From: Greg Kroah-Hartman
Date: Fri Sep 28 2012 - 17:08:00 EST


3.0-stable review patch. If anyone has any objections, please let me know.

------------------

From: Yasunori Goto <y-goto@xxxxxxxxxxxxxx>

commit b5740f4b2cb3503b436925eb2242bc3d75cd3dfe upstream.

try_to_wake_up() has a problem which may change status from TASK_DEAD to
TASK_RUNNING in race condition with SMI or guest environment of virtual
machine. As a result, exited task is scheduled() again and panic occurs.

Here is the sequence how it occurs:

----------------------------------+-----------------------------
|
CPU A | CPU B
----------------------------------+-----------------------------

TASK A calls exit()....

do_exit()

exit_mm()
down_read(mm->mmap_sem);

rwsem_down_failed_common()

set TASK_UNINTERRUPTIBLE
set waiter.task <= task A
list_add to sem->wait_list
:
raw_spin_unlock_irq()
(I/O interruption occured)

__rwsem_do_wake(mmap_sem)

list_del(&waiter->list);
waiter->task = NULL
wake_up_process(task A)
try_to_wake_up()
(task is still
TASK_UNINTERRUPTIBLE)
p->on_rq is still 1.)

ttwu_do_wakeup()
(*A)
:
(I/O interruption handler finished)

if (!waiter.task)
schedule() is not called
due to waiter.task is NULL.

tsk->state = TASK_RUNNING

:
check_preempt_curr();
:
task->state = TASK_DEAD
(*B)
<--- set TASK_RUNNING (*C)

schedule()
(exit task is running again)
BUG_ON() is called!
--------------------------------------------------------

The execution time between (*A) and (*B) is usually very short,
because the interruption is disabled, and setting TASK_RUNNING at (*C)
must be executed before setting TASK_DEAD.

HOWEVER, if SMI is interrupted between (*A) and (*B),
(*C) is able to execute AFTER setting TASK_DEAD!
Then, exited task is scheduled again, and BUG_ON() is called....

If the system works on guest system of virtual machine, the time
between (*A) and (*B) may be also long due to scheduling of hypervisor,
and same phenomenon can occur.

By this patch, do_exit() waits for releasing task->pi_lock which is used
in try_to_wake_up(). It guarantees the task becomes TASK_DEAD after
waking up.

Signed-off-by: Yasunori Goto <y-goto@xxxxxxxxxxxxxx>
Acked-by: Oleg Nesterov <oleg@xxxxxxxxxx>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@xxxxxxxxx>
Cc: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
Link: http://lkml.kernel.org/r/20120117174031.3118.E1E9C6FF@xxxxxxxxxxxxxx
Signed-off-by: Ingo Molnar <mingo@xxxxxxx>
Cc: Michal Hocko <mhocko@xxxxxxx>
Signed-off-by: Greg Kroah-Hartman <gregkh@xxxxxxxxxxxxxxxxxxx>

---
kernel/exit.c | 16 ++++++++++++++++
1 file changed, 16 insertions(+)

--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -1049,6 +1049,22 @@ NORET_TYPE void do_exit(long code)

preempt_disable();
exit_rcu();
+
+ /*
+ * The setting of TASK_RUNNING by try_to_wake_up() may be delayed
+ * when the following two conditions become true.
+ * - There is race condition of mmap_sem (It is acquired by
+ * exit_mm()), and
+ * - SMI occurs before setting TASK_RUNINNG.
+ * (or hypervisor of virtual machine switches to other guest)
+ * As a result, we may become TASK_RUNNING after becoming TASK_DEAD
+ *
+ * To avoid it, we have to wait for releasing tsk->pi_lock which
+ * is held by try_to_wake_up()
+ */
+ smp_mb();
+ raw_spin_unlock_wait(&tsk->pi_lock);
+
/* causes final put_task_struct in finish_task_switch(). */
tsk->state = TASK_DEAD;
schedule();


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/