[RFC] wait*() induced tasklist_lock starvation

From: David Rientjes
Date: Sun Jan 26 2014 - 18:04:38 EST

Next message: Dmitry Torokhov: "Re: [PATCH] max8925_power: Use "IS_ENABLED(CONFIG_OF)" for DT code."
Previous message: Dmitry Eremin-Solenikov: "Re: [PATCH] max8925_power: Use "IS_ENABLED(CONFIG_OF)" for DT code."
Next in thread: Oleg Nesterov: "Re: [RFC] wait*() induced tasklist_lock starvation"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi Oleg,

We've found that it's pretty easy to cause NMI watchdog timeouts due to
tasklist_lock starvation by using repeated wait4(), waitid(), or waitpid()
since it takes the readside of the lock and cascading calls to the
syscalls from multiple processes will starve anything in the fork() or
exit() path that is waiting on the writeside with irqs disabled.

The only way I've been able to remedy this problem is by serializing the
taking of the readside of this lock with a spinlock specifically for these
syscalls, otherwise my testcase will panic any machine if we panic on
these NMI watchdog timeouts, which we do.

Is there any way we can do this in a less expensive way? Or is it just
another case of tasklist_lock problems that needs a major overhaul?
---
kernel/exit.c | 14 ++++++++++++++
1 file changed, 14 insertions(+)

diff --git a/kernel/exit.c b/kernel/exit.c
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -59,6 +59,14 @@
#include <asm/pgtable.h>
#include <asm/mmu_context.h>

+/*
+ * Ensures the wait family of syscalls -- wait4(), waitid(), and waitpid() --
+ * don't cascade taking readside of tasklist_lock which will starve processes
+ * doing fork() or exit() and cause NMI watchdog timeouts with interrupts
+ * disabled.
+ */
+static DEFINE_SPINLOCK(wait_lock);
+
static void exit_mm(struct task_struct * tsk);

static void __unhash_process(struct task_struct *p, bool group_dead)
@@ -1028,6 +1036,7 @@ static int wait_task_zombie(struct wait_opts *wo, struct task_struct *p)

get_task_struct(p);
read_unlock(&tasklist_lock);
+ spin_unlock(&wait_lock);
if ((exit_code & 0x7f) == 0) {
why = CLD_EXITED;
status = exit_code >> 8;
@@ -1112,6 +1121,7 @@ static int wait_task_zombie(struct wait_opts *wo, struct task_struct *p)
* thread can reap it because we set its state to EXIT_DEAD.
*/
read_unlock(&tasklist_lock);
+ spin_unlock(&wait_lock);

retval = wo->wo_rusage
? getrusage(p, RUSAGE_BOTH, wo->wo_rusage) : 0;
@@ -1246,6 +1256,7 @@ unlock_sig:
pid = task_pid_vnr(p);
why = ptrace ? CLD_TRAPPED : CLD_STOPPED;
read_unlock(&tasklist_lock);
+ spin_unlock(&wait_lock);

if (unlikely(wo->wo_flags & WNOWAIT))
return wait_noreap_copyout(wo, p, pid, uid, why, exit_code);
@@ -1308,6 +1319,7 @@ static int wait_task_continued(struct wait_opts *wo, struct task_struct *p)
pid = task_pid_vnr(p);
get_task_struct(p);
read_unlock(&tasklist_lock);
+ spin_unlock(&wait_lock);

if (!wo->wo_info) {
retval = wo->wo_rusage
@@ -1523,6 +1535,7 @@ repeat:
goto notask;

set_current_state(TASK_INTERRUPTIBLE);
+ spin_lock(&wait_lock);
read_lock(&tasklist_lock);
tsk = current;
do {
@@ -1538,6 +1551,7 @@ repeat:
break;
} while_each_thread(current, tsk);
read_unlock(&tasklist_lock);
+ spin_unlock(&wait_lock);

notask:
retval = wo->notask_error;
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Dmitry Torokhov: "Re: [PATCH] max8925_power: Use "IS_ENABLED(CONFIG_OF)" for DT code."
Previous message: Dmitry Eremin-Solenikov: "Re: [PATCH] max8925_power: Use "IS_ENABLED(CONFIG_OF)" for DT code."
Next in thread: Oleg Nesterov: "Re: [RFC] wait*() induced tasklist_lock starvation"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]