Re: Reiserfs deadlock in 2.6.36

From: Frederic Weisbecker
Date: Tue Mar 08 2011 - 09:06:03 EST


On Tue, Mar 08, 2011 at 09:41:15AM +0100, Bastien ROUCARIES wrote:
> On Mon, Mar 7, 2011 at 8:00 PM, Frederic Weisbecker <fweisbec@xxxxxxxxx> wrote:
> > Hi Bastien,
>
> Cc: Ingo Molnar because he work a lot on soft lockup, and could have
> an idea to debug
> cc: andrew morton that trakc also "File/memory corruption in 2.6.37"

About the corruption, I'm not sure it's the same problem. It's hard to
tell yet.

> >> I take me more than two days of testing to reporduce this bugs with trace enabled. My filesystem was quite slow and this bugs seems
> >> to be timing related.
> >>
> >> One patern that trigger this bug is git. Doing a lot of git work of my desktop crash my machine.
> >>
> >> Moreover, trying to reproduce this bug lead to data loss. I have rebuilded twice my / partition using --rebuild-tree, and restored
> >> my home partition three times using backups.
> >>
> >> My log is here.
> >>
> >> Do you need more information?
> >
> > Yeah do you have CONFIG_REISERFS_CHECK? I just would
> > like to ensure we are not missing this important source of
> > information.
>
> Yes I have it

Ok.

> > I'm puzzled because, given the traces, your opening and closing of the journal are
> > well balanced.
> >
> > You have a writer queued and stuck but I see no trace of it in the traces stream.
> > I only see well balanced journal operations, including journal closing that would have
> > woken your queued writer.
> >
> > A theory could be that your queued writer was waiting for someone to close the journal,
> > which finally happen but actually several minutes later, after there was many
> > journal opening/closing that overwrote the old trace containing the queueing of
> > the stuck writer.
>
> Doing a while true;do sync && sleep1; done; help a lot

Which kernel are you running by the way?

> >
> > I don't know what to do yet. I need to think more about it.
> >
>
> Could we do the stuff I have sugested at first ? use lockdep to track
> journal open,/close using fake lock ?

I don't think it's not an adapted test. Lockdep is useful to detect lock inversion
scenarios but that's not very useful to detect a lock that takes too much time
to be released. For that we have the hung task detector, whose report we already
have.

> BTW it seems that someone experiment this confition on ext3. I could
> do more testing if you want, and I will run xfstests in order to see
> if I could reproduce more quickly

I'm not sure the file corruption and the deadlock are linked. But
may be xfstest can provoke the deadlock (or the file corruption)
more quickly. It's pretty good at stressing file systems.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/