[PATCH] Re: [2.6.20] BUG: workqueue leaked lock

From: Jarek Poplawski
Date: Tue Mar 20 2007 - 05:33:42 EST


On 19-03-2007 07:24, Neil Brown wrote:
> On Friday March 16, akpm@xxxxxxxxxxxxxxxxxxxx wrote:
>> OK. That's not necessarily a bug: one could envisage a (weird) piece of
>> code which takes a lock then releases it on a later workqueue invokation.
>> But I'm not sure that nfs4_laundromat() is actually supposed to be doing
>> anything like that.
>>
>> Then again, maybe it is: it seems to be waddling through a directory under
>> the control of a little state machine, with timeouts.
>>
>> Neil: help?
>
> I'm quite certain that laundromat_main does *not* leave client_mutex
> locked as the last thing it does is call nfs4_unlock_state which is
> mutex_unlock(&client_mutex);
> To me, that raises some doubt about whether the lock leak check is
> working properly...
> It is somewhat harder to track locking of i_mutex, but it seems to me
> that every time it is taken, it is released again shortly afterwards.
>
> So I think this must be a problem with leak detection, not with NFSd.
>
> NeilBrown
>
>
>> On Fri, 16 Mar 2007 09:41:20 +0100 Peter Zijlstra <a.p.zijlstra@xxxxxxxxx> wrote:
>>
>>> On Thu, 2007-03-15 at 11:06 -0800, Andrew Morton wrote:
>>>>> On Tue, 13 Mar 2007 17:50:14 +0100 Folkert van Heusden <folkert@xxxxxxxxxxxxxx> wrote:
>>>>> ...
>>>>> [ 1756.728209] BUG: workqueue leaked lock or atomic: nfsd4/0x00000000/3577
>>>>> [ 1756.728271] last function: laundromat_main+0x0/0x69 [nfsd]
>>>>> [ 1756.728392] 2 locks held by nfsd4/3577:
>>>>> [ 1756.728435] #0: (client_mutex){--..}, at: [<c1205b88>] mutex_lock+0x8/0xa
>>>>> [ 1756.728679] #1: (&inode->i_mutex){--..}, at: [<c1205b88>] mutex_lock+0x8/0xa
>>>>> [ 1756.728923] [<c1003d57>] show_trace_log_lvl+0x1a/0x30
>>>>> [ 1756.729015] [<c1003d7f>] show_trace+0x12/0x14
>>>>> [ 1756.729103] [<c1003e79>] dump_stack+0x16/0x18
>>>>> [ 1756.729187] [<c102c2e8>] run_workqueue+0x167/0x170
>>>>> [ 1756.729276] [<c102c437>] worker_thread+0x146/0x165
>>>>> [ 1756.729368] [<c102f797>] kthread+0x97/0xc4
>>>>> [ 1756.729456] [<c1003bdb>] kernel_thread_helper+0x7/0x10
>>>>> [ 1756.729547] =======================
...
>>> I think I'm responsible for this message (commit
>>> d5abe669172f20a4129a711de0f250a4e07db298); what is says is that the
>>> function executed by the workqueue (here laundromat_main) leaked an
>>> atomic context or is still holding locks (2 locks in this case).

This check is valid with keventd, but it looks like nfsd runs
kthread by itself. I'm not sure it's illegal to hold locks then,
so, if I'm not wrong with this, here is my patch proposal (for
testing) to take this into consideration.

Reported-by: Folkert van Heusden <folkert@xxxxxxxxxxxxxx>
Signed-off-by: Jarek Poplawski <jarkao2@xxxxx>

---

diff -Nurp 2.6.21-rc4-git4-/kernel/workqueue.c 2.6.21-rc4-git4/kernel/workqueue.c
--- 2.6.21-rc4-git4-/kernel/workqueue.c 2007-02-04 19:44:54.000000000 +0100
+++ 2.6.21-rc4-git4/kernel/workqueue.c 2007-03-20 09:30:46.000000000 +0100
@@ -316,6 +316,7 @@ static void run_workqueue(struct cpu_wor
struct work_struct *work = list_entry(cwq->worklist.next,
struct work_struct, entry);
work_func_t f = work->func;
+ int ld;

list_del_init(cwq->worklist.next);
spin_unlock_irqrestore(&cwq->lock, flags);
@@ -323,13 +324,15 @@ static void run_workqueue(struct cpu_wor
BUG_ON(get_wq_data(work) != cwq);
if (!test_bit(WORK_STRUCT_NOAUTOREL, work_data_bits(work)))
work_release(work);
+ ld = lockdep_depth(current);
+
f(work);

- if (unlikely(in_atomic() || lockdep_depth(current) > 0)) {
+ if (unlikely(in_atomic() || (ld -= lockdep_depth(current)))) {
printk(KERN_ERR "BUG: workqueue leaked lock or atomic: "
- "%s/0x%08x/%d\n",
+ "%s/0x%08x/%d/%d\n",
current->comm, preempt_count(),
- current->pid);
+ current->pid, ld);
printk(KERN_ERR " last function: ");
print_symbol("%s\n", (unsigned long)f);
debug_show_held_locks(current);
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/