Re: Bisected rcu hang (kernel/sched.c): was 2.6.33rc4 RCU hang mmspin_lock deadlock(?) after running libvirtd - reproducible.

From: Michael Breuer
Date: Sun Jan 24 2010 - 01:32:38 EST

Next message: AmÃrico Wang: "Re: [2.6.33-rc5 regression] NULL pointer dereference invlan_skb_recv - probably introduced by commit9793241fe92f7d9303fb221e43fc598eb065f267"
Previous message: Jaswinder Singh Rajput: "Re: [lm-sensors] Status of Andigilog asc7621 driver submitted by George Joseph on 2008-05-29"
In reply to: Mike Galbraith: "Re: Bisected rcu hang (kernel/sched.c): was 2.6.33rc4 RCU hang mmspin_lock deadlock(?) after running libvirtd - reproducible."
Next in thread: Mike Galbraith: "Re: Bisected rcu hang (kernel/sched.c): was 2.6.33rc4 RCU hang mmspin_lock deadlock(?) after running libvirtd - reproducible."
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 1/24/2010 12:59 AM, Mike Galbraith wrote:

On Sat, 2010-01-23 at 21:49 -0500, Michael Breuer wrote:

On 01/13/2010 01:43 PM, Michael Breuer wrote:

I can now recreate this simply by "service start libvirtd" on an F12
box. My earlier report that suggested this had something to do with
the sky2 driver was incorrect. Interestingly, it's always CPU1
whenever I start libvirtd.
Attaching two of the traces (I've got about ten, but they're all
pretty much the same). Looks pretty consistent - libvirtd in CPU1 is
hung forking. Not sure why yet - perhaps someone who knows this better
than I can jump in.
Summary of hang appears to be libvirtd forks - two threads show with
same pid deadlocked on a spin_lock

Then if looking at the stack traces doesn't locate the offending loop,
bisection might help.

It would, however it's going to be really difficult as I wasn't able
to get this far with rc1& rc2 :(

Thanx, Paul

I was finally able to bisect this to commit:
3802290628348674985d14914f9bfee7b9084548 (see below)

I suspect something went wrong during bisection, however...

Jan 13 12:59:25 mail kernel: [<ffffffff810447be>] ? set_cpus_allowed_ptr+0x22/0x14b
Jan 13 12:59:25 mail kernel: [<ffffffff810f2a7b>] ? spin_lock+0xe/0x10
Jan 13 12:59:25 mail kernel: [<ffffffff81087d79>] cpuset_attach_task+0x27/0x9b
Jan 13 12:59:25 mail kernel: [<ffffffff81087e77>] cpuset_attach+0x8a/0x133
Jan 13 12:59:25 mail kernel: [<ffffffff81042d2c>] ? sched_move_task+0x104/0x110
Jan 13 12:59:25 mail kernel: [<ffffffff81085dbd>] cgroup_attach_task+0x4e1/0x53f
Jan 13 12:59:25 mail kernel: [<ffffffff81084f48>] ? cgroup_populate_dir+0x77/0xff
Jan 13 12:59:25 mail kernel: [<ffffffff81086073>] cgroup_clone+0x258/0x2ac
Jan 13 12:59:25 mail kernel: [<ffffffff81088d04>] ns_cgroup_clone+0x58/0x75
Jan 13 12:59:25 mail kernel: [<ffffffff81048f3d>] copy_process+0xcef/0x13af
Jan 13 12:59:25 mail kernel: [<ffffffff810d963c>] ? handle_mm_fault+0x355/0x7ff
Jan 13 12:59:25 mail kernel: [<ffffffff81049768>] do_fork+0x16b/0x309
Jan 13 12:59:25 mail kernel: [<ffffffff81252ab2>] ? __up_read+0x8e/0x97
Jan 13 12:59:25 mail kernel: [<ffffffff81068c92>] ? up_read+0xe/0x10
Jan 13 12:59:25 mail kernel: [<ffffffff8145a779>] ? do_page_fault+0x280/0x2cc
Jan 13 12:59:25 mail kernel: [<ffffffff81010f2e>] sys_clone+0x28/0x2a
Jan 13 12:59:25 mail kernel: [<ffffffff81009f33>] stub_clone+0x13/0x20
Jan 13 12:59:25 mail kernel: [<ffffffff81009bf2>] ? system_call_fastpath+0x16/0x1b

...that looks like a bug which has already been fixed in tip, but not
yet propagated. Your trace looks like relax forever scenario.

commit fabf318e5e4bda0aca2b0d617b191884fda62703
Author: Peter Zijlstra<a.p.zijlstra@xxxxxxxxx>
Date: Thu Jan 21 21:04:57 2010 +0100

sched: Fix fork vs hotplug vs cpuset namespaces

There are a number of issues:

1) TASK_WAKING vs cgroup_clone (cpusets)

copy_process():

sched_fork()
child->state = TASK_WAKING; /* waiting for wake_up_new_task() */
if (current->nsproxy != p->nsproxy)
ns_cgroup_clone()
cgroup_clone()
mutex_lock(inode->i_mutex)
mutex_lock(cgroup_mutex)
cgroup_attach_task()
ss->can_attach()
ss->attach() [ -> cpuset_attach() ]
cpuset_attach_task()
set_cpus_allowed_ptr();
while (child->state == TASK_WAKING)
cpu_relax();
will deadlock the system.

2) cgroup_clone (cpusets) vs copy_process

So even if the above would work we still have:

copy_process():

if (current->nsproxy != p->nsproxy)
ns_cgroup_clone()
cgroup_clone()
mutex_lock(inode->i_mutex)
mutex_lock(cgroup_mutex)
cgroup_attach_task()
ss->can_attach()
ss->attach() [ -> cpuset_attach() ]
cpuset_attach_task()
set_cpus_allowed_ptr();
...

p->cpus_allowed = current->cpus_allowed

over-writing the modified cpus_allowed.

3) fork() vs hotplug

if we unplug the child's cpu after the sanity check when the child
gets attached to the task_list but before wake_up_new_task() shit
will meet with fan.

Solve all these issues by moving fork cpu selection into
wake_up_new_task().

Reported-by: Serge E. Hallyn<serue@xxxxxxxxxx>
Tested-by: Serge E. Hallyn<serue@xxxxxxxxxx>
Signed-off-by: Peter Zijlstra<a.p.zijlstra@xxxxxxxxx>
LKML-Reference:<1264106190.4283.1314.camel@laptop>
Signed-off-by: Thomas Gleixner<tglx@xxxxxxxxxxxxx>

diff --git a/kernel/fork.c b/kernel/fork.c
index 5b2959b..f88bd98 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1241,21 +1241,6 @@ static struct task_struct *copy_process(unsigned long clone_flags,
/* Need tasklist lock for parent etc handling! */
write_lock_irq(&tasklist_lock);

- /*
- * The task hasn't been attached yet, so its cpus_allowed mask will
- * not be changed, nor will its assigned CPU.
- *
- * The cpus_allowed mask of the parent may have changed after it was
- * copied first time - so re-copy it here, then check the child's CPU
- * to ensure it is on a valid CPU (and if not, just force it back to
- * parent's CPU). This avoids alot of nasty races.
- */
- p->cpus_allowed = current->cpus_allowed;
- p->rt.nr_cpus_allowed = current->rt.nr_cpus_allowed;
- if (unlikely(!cpu_isset(task_cpu(p), p->cpus_allowed) ||
- !cpu_online(task_cpu(p))))
- set_task_cpu(p, smp_processor_id());
-
/* CLONE_PARENT re-uses the old parent */
if (clone_flags& (CLONE_PARENT|CLONE_THREAD)) {
p->real_parent = current->real_parent;
diff --git a/kernel/sched.c b/kernel/sched.c
index 4508fe7..3a8fb30 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2320,14 +2320,12 @@ static int select_fallback_rq(int cpu, struct task_struct *p)
}

/*
- * Called from:
+ * Gets called from 3 sites (exec, fork, wakeup), since it is called without
+ * holding rq->lock we need to ensure ->cpus_allowed is stable, this is done
+ * by:
*
- * - fork, @p is stable because it isn't on the tasklist yet
- *
- * - exec, @p is unstable, retry loop
- *
- * - wake-up, we serialize ->cpus_allowed against TASK_WAKING so
- * we should be good.
+ * exec: is unstable, retry loop
+ * fork& wake-up: serialize ->cpus_allowed against TASK_WAKING
*/
static inline
int select_task_rq(struct task_struct *p, int sd_flags, int wake_flags)
@@ -2620,9 +2618,6 @@ void sched_fork(struct task_struct *p, int clone_flags)
if (p->sched_class->task_fork)
p->sched_class->task_fork(p);

-#ifdef CONFIG_SMP
- cpu = select_task_rq(p, SD_BALANCE_FORK, 0);
-#endif
set_task_cpu(p, cpu);

#if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT)
@@ -2652,6 +2647,21 @@ void wake_up_new_task(struct task_struct *p, unsigned long clone_flags)
{
unsigned long flags;
struct rq *rq;
+ int cpu = get_cpu();
+
+#ifdef CONFIG_SMP
+ /*
+ * Fork balancing, do it here and not earlier because:
+ * - cpus_allowed can change in the fork path
+ * - any previously selected cpu might disappear through hotplug
+ *
+ * We still have TASK_WAKING but PF_STARTING is gone now, meaning
+ * ->cpus_allowed is stable, we have preemption disabled, meaning
+ * cpu_online_mask is stable.
+ */
+ cpu = select_task_rq(p, SD_BALANCE_FORK, 0);
+ set_task_cpu(p, cpu);
+#endif

rq = task_rq_lock(p,&flags);
BUG_ON(p->state != TASK_WAKING);
@@ -2665,6 +2675,7 @@ void wake_up_new_task(struct task_struct *p, unsigned long clone_flags)
p->sched_class->task_woken(rq, p);
#endif
task_rq_unlock(rq,&flags);
+ put_cpu();
}

#ifdef CONFIG_PREEMPT_NOTIFIERS
@@ -7139,14 +7150,18 @@ int set_cpus_allowed_ptr(struct task_struct *p, const struct cpumask *new_mask)
* the ->cpus_allowed mask from under waking tasks, which would be
* possible when we change rq->lock in ttwu(), so synchronize against
* TASK_WAKING to avoid that.
+ *
+ * Make an exception for freshly cloned tasks, since cpuset namespaces
+ * might move the task about, we have to validate the target in
+ * wake_up_new_task() anyway since the cpu might have gone away.
*/
again:
- while (p->state == TASK_WAKING)
+ while (p->state == TASK_WAKING&& !(p->flags& PF_STARTING))
cpu_relax();

rq = task_rq_lock(p,&flags);

- if (p->state == TASK_WAKING) {
+ if (p->state == TASK_WAKING&& !(p->flags& PF_STARTING)) {
task_rq_unlock(rq,&flags);
goto again;
}

That commit solves my crash. Was my first time bisecting... thought I got it right. With the referenced commit, the system crashed when libvirtd was started, at the previous commit, it didn't crash. Regardless, the commit in tip fixes the issue. Hopefully it'll solve some of the other reported RCU hangs as well. Looks like the change for freshly cloned tasks is key.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: AmÃrico Wang: "Re: [2.6.33-rc5 regression] NULL pointer dereference invlan_skb_recv - probably introduced by commit9793241fe92f7d9303fb221e43fc598eb065f267"
Previous message: Jaswinder Singh Rajput: "Re: [lm-sensors] Status of Andigilog asc7621 driver submitted by George Joseph on 2008-05-29"
In reply to: Mike Galbraith: "Re: Bisected rcu hang (kernel/sched.c): was 2.6.33rc4 RCU hang mmspin_lock deadlock(?) after running libvirtd - reproducible."
Next in thread: Mike Galbraith: "Re: Bisected rcu hang (kernel/sched.c): was 2.6.33rc4 RCU hang mmspin_lock deadlock(?) after running libvirtd - reproducible."
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]