Re: Reiserfs deadlock in 2.6.36

From: Frederic Weisbecker
Date: Thu Dec 02 2010 - 12:43:42 EST


On Fri, Nov 26, 2010 at 05:57:05PM +0100, Bastien ROUCARIES wrote:
> Dear frederic,
> > Hi Bastien,
> >
> > This really looks like a hung task detector report.
> > Several tasks are stuck in queue_log_writer(), waiting
> > to be woken up on the "journal->j_join_wait" event and
> > that never happens because the waker is also stuck.
> > The problem is your report doesn't show where the waker
> > is stuck, but the hung task detector reports it, it just
> > did before or after the chunk you've posted.
> >
> > If you could provide me the entire report, I could fix this
> > easily.
>
> I have manged to reproduce it after six hour of stress. Unfornatly locked was
> disabled due to a known non bug, in the init sequence. I have used sysrq -t in
> order to get more information to you.
>
> Do I need to try to reproduce it, with a newer kernel ? Or it is sufficient ?

> Nov 26 16:27:56 portablebastien kernel: [27960.775903] kded4 D 00000001006907a6 0 2852 1 0x00000000
> Nov 26 16:27:56 portablebastien kernel: [27960.777842] ffff8800d8a97b28 0000000000000046 ffff880000000000 ffff880100000000
> Nov 26 16:27:56 portablebastien kernel: [27960.779768] ffff8800d8a96010 ffff8800d8a97fd8 ffff8800379f4f60 ffff8800379f5230
> Nov 26 16:27:56 portablebastien kernel: [27960.781694] ffff8800379f5228 0000000000014d80 0000000000014d80 ffff8800d8a97fd8
> Nov 26 16:27:56 portablebastien kernel: [27960.783594] Call Trace:
> Nov 26 16:27:56 portablebastien kernel: [27960.785483] [<ffffffffa01b8454>] queue_log_writer+0x7e/0xaf [reiserfs]
> Nov 26 16:27:56 portablebastien kernel: [27960.787344] [<ffffffff81044423>] ? default_wake_function+0x0/0xf
> Nov 26 16:27:56 portablebastien kernel: [27960.789253] [<ffffffffa01bc402>] do_journal_begin_r+0x1ee/0x2d8 [reiserfs]
> Nov 26 16:27:56 portablebastien kernel: [27960.791142] [<ffffffffa01bc5ae>] journal_begin+0xc2/0x103 [reiserfs]
> Nov 26 16:27:56 portablebastien kernel: [27960.793070] [<ffffffffa019ebb6>] reiserfs_create+0x105/0x233 [reiserfs]
> Nov 26 16:27:56 portablebastien kernel: [27960.794960] [<ffffffff8110b57d>] ? generic_permission+0x17/0x9a
> Nov 26 16:27:56 portablebastien kernel: [27960.796854] [<ffffffff81171e65>] ? security_inode_permission+0x1c/0x1e
> Nov 26 16:27:56 portablebastien kernel: [27960.798714] [<ffffffff8110c423>] vfs_create+0x6b/0x8d
> Nov 26 16:27:56 portablebastien kernel: [27960.800570] [<ffffffff8110cdee>] do_last+0x26c/0x532
> Nov 26 16:27:56 portablebastien kernel: [27960.802377] [<ffffffff8110eb96>] do_filp_open+0x203/0x599
> Nov 26 16:27:56 portablebastien kernel: [27960.804232] [<ffffffff8134bd2b>] ? _raw_spin_unlock+0x26/0x2a
> Nov 26 16:27:56 portablebastien kernel: [27960.806058] [<ffffffff811184a0>] ? alloc_fd+0x170/0x182
> Nov 26 16:27:56 portablebastien kernel: [27960.807911] [<ffffffff81101366>] do_sys_open+0x5b/0xf7
> Nov 26 16:27:56 portablebastien kernel: [27960.809790] [<ffffffff8134b48e>] ? trace_hardirqs_on_thunk+0x3a/0x3f
> Nov 26 16:27:56 portablebastien kernel: [27960.811646] [<ffffffff8110142b>] sys_open+0x1b/0x1d
> Nov 26 16:27:56 portablebastien kernel: [27960.813506] [<ffffffff81009ac2>] system_call_fastpath+0x16/0x1b

Ok, this time I don't have the feeling that a deadlock between reiserfs lock and
another lock is involved.

We entered queue_log_writer() and then waited for someone to call do_journal_end()
to testify he finished his job with the journal.

But somehow that didn't happen. Or may be we called queue_log_writer() but we shouldn't,
thinking there was a writer already but there wasn't. Or there is a crazy race somewhere.

On which kernel do you see this? Do you know a kernel on which you've never seen it.
Were you running something specific to trigger this deadlock?

Thanks!
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/