[BUG] circular lock dependency in tip
From: Steven Rostedt
Date: Wed Mar 18 2009 - 17:33:59 EST
I hit this bug in latest tip:
bxrhel51 login: [ 2773.995862]
[ 2773.995867] =======================================================
[ 2773.998502] [ INFO: possible circular locking dependency detected ]
[ 2773.998502] 2.6.29-rc8-tip #760
[ 2773.998502] -------------------------------------------------------
[ 2773.998502] grep/15590 is trying to acquire lock:
[ 2773.998502] (&q->lock){-.-.-.}, at: [<ffffffff80246961>]
__wake_up+0x31/0x75
[ 2773.998502]
[ 2773.998502] but task is already holding lock:
[ 2773.998502] (&rq->lock){-.-.-.}, at: [<ffffffff80eaa1b6>]
schedule+0x147/0x770
[ 2773.998502]
[ 2773.998502] which lock already depends on the new lock.
[ 2773.998502]
[ 2773.998502]
[ 2773.998502] the existing dependency chain (in reverse order) is:
[ 2773.998502]
[ 2773.998502] -> #1 (&rq->lock){-.-.-.}:
[ 2773.998502] [<ffffffff8028274e>] __lock_acquire+0xa43/0xc0a
[ 2773.998502] [<ffffffff8028392a>] lock_acquire+0x11e/0x15e
[ 2773.998502] [<ffffffff80eacd07>] _spin_lock+0x4b/0x94
[ 2773.998502] [<ffffffff80246bf2>] task_rq_lock+0x5f/0xa7
[ 2773.998502] [<ffffffff8024f97b>] try_to_wake_up+0x40/0x2c1
[ 2773.998502] [<ffffffff8024fc1c>] default_wake_function+0x20/0x36
[ 2773.998502] [<ffffffff80244d2a>] __wake_up_common+0x55/0x9b
[ 2773.998502] [<ffffffff8024688e>] complete+0x47/0x71
[ 2773.998502] [<ffffffff80270727>] kthreadd+0x165/0x196
[ 2773.998502] [<ffffffff8020cd1a>] child_rip+0xa/0x20
[ 2773.998502] [<ffffffffffffffff>] 0xffffffffffffffff
[ 2773.998502]
[ 2773.998502] -> #0 (&q->lock){-.-.-.}:
[ 2773.998502] [<ffffffff80282630>] __lock_acquire+0x925/0xc0a
[ 2773.998502] [<ffffffff8028392a>] lock_acquire+0x11e/0x15e
[ 2773.998502] [<ffffffff80ead75a>] _spin_lock_irqsave+0x63/0xb1
[ 2773.998502] [<ffffffff80246961>] __wake_up+0x31/0x75
[ 2773.998502] [<ffffffff802d4076>] wakeup_kswapd+0x78/0x96
[ 2773.998502] [<ffffffff802cb7b7>]
__alloc_pages_internal+0x110/0x407
[ 2773.998502] [<ffffffff802faf8d>] alloc_pages_node+0x47/0x63
[ 2773.998502] [<ffffffff802fb0f3>] alloc_slab_page+0x3e/0x54
[ 2773.998502] [<ffffffff802fb160>] new_slab+0x57/0x20a
[ 2773.998502] [<ffffffff802fb944>] __slab_alloc+0x20c/0x394
[ 2773.998502] [<ffffffff802fbb5b>]
kmem_cache_alloc_node_notrace+0x8f/0x106
[ 2773.998502] [<ffffffff8068323a>]
alloc_cpumask_var_node+0x48/0xec
[ 2773.998502] [<ffffffff80683302>] alloc_cpumask_var+0x24/0x3d
z[ 2773.998502] [<ffffffff80eaa25a>] schedule+0x1eb/0x770
[ 2773.998502] [<ffffffff80eaa8e8>] io_schedule+0x3e/0x65
[ 2773.998502] [<ffffffff802c3c9f>] sync_page+0x52/0x6a
[ 2773.998502] [<ffffffff80eaad61>] __wait_on_bit_lock+0x51/0xb0
[ 2773.998502] [<ffffffff802c3bc1>] __lock_page+0x71/0x8c
Seems that the issue is that we have:
static void idle_balance(int this_cpu, struct rq *this_rq)
{
struct sched_domain *sd;
int pulled_task = 0;
unsigned long next_balance = jiffies + HZ;
cpumask_var_t tmpmask;
if (!alloc_cpumask_var(&tmpmask, GFP_ATOMIC))
return;
[...]
and inside the scheduler we have:
if (unlikely(!rq->nr_running))
idle_balance(cpu, rq);
Attached is the full log.
-- Steve
Attachment:
log2.gz
Description: Binary data