Re: frequent lockups in 3.18rc4

From: Chris Mason
Date: Fri Dec 05 2014 - 14:04:54 EST

On Fri, Dec 5, 2014 at 1:38 PM, Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
On Fri, Dec 5, 2014 at 9:15 AM, Dave Jones <davej@xxxxxxxxxx> wrote:

A bisect later, and I landed on a kernel that ran for a day, before
spewing NMI messages, recovering, and then..

I have to admit I'm seeing absolutely nothing sensible in there.

Call it bad, and see if bisection ends up slowly -oh so slowly -
pointing to some direction. Because I don't think it's the hardware,
considering that apparently 3.16 is solid. And the spews themselves
are so incomprehensible that I'm not seeing any pattern what-so-ever.

I went back through all of the traces Dave has posted in this thread. This one looks like vm debugging is on:

Another had a function call from CONFIG_DEBUG_PAGEALLOC:

So one idea is that our allocation/freeing of pages is dramatically more expensive and we're hitting a strange edge condition. Maybe we're even faulting on a readonly page from a horrible place?

[83246.925234] end_request: I/O error, dev sda, sector 0

Ext3/4 shouldn't be doing IO to sector zero. Something is stomping on ram?


