Re: [2.4] NMI WD detected lockup during page alloc

From: Oleg Drokin
Date: Tue Apr 06 2004 - 02:05:45 EST


Hello!

On Tue, Apr 06, 2004 at 12:12:55AM +0200, Andrea Arcangeli wrote:
> > In addition to what I have compiled in:
> > # lsmod
> > Module Size Used by Not tainted
> > ppp_deflate 4568 1 (autoclean)
> you may want to disable compression, this sounds like mm corruption and
> compression isn't trivial to handle in kernel skbs (though I doubt this
> is the problem but it's easy to disable).

Ok.

> > ipt_state 1016 4 (autoclean)
> the hang while unloading this module may also be a sign of a bug in the
> module so it would be nice if you could reproduce also w/o the above
> ips_state.

Unfortunatelly this is not as easy to do, though I believe there is just some
sort or race on unload that is not being hit until module is unloaded and
therefore it is completely not related.

> If this still doesn't help then you can try to go UP again, SMP is
> harder at stressing the memory bus and see if it stabilizes. Other thing
> you can do is to remove half of the ram and see if it stabilizes to try
> to identify buggy ram slots.

There I have ECC RAM, passed 14 days of memtest (yes, I know memtest uses
only 1 CPU), so I do not think I have memory problems, though this is not
absolute guarantee against that of course.
Also running in UP mode for weeks is not all that funny and still proves nothing
as I do not have clear way to reproduce it in certain time.

> Overall it's unlikely the oops is useful unfortunately since that piece
> of the kernel is the most stressed ever, and it just signals random mm
> corruption. I assume this is the first time you've got the nmi watchdog
> oops, if you could get it again it would be more interesting, I'd expect
> next time you would get it in another place.

Well, I had a hang before this oops and that was main reason I enabled NMI
watchdog. At that first hang nothing get to serial console so I guessed
it was similar spinlock deadlock.
We'll see what I get when another NMI watchdog thing occurs. I run
with spinlock debug this time, so hopefully if spinlock is really just
corrupted, its magic would be corrupted as well and I get clear warning about
that.

Thank you.

Bye,
Oleg
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/