Re: 2.6.23 spinlock hang in kswapd under heavy disk write loads
From: Nick Piggin
Date: Wed Oct 10 2007 - 23:24:00 EST
On Friday 12 October 2007 10:56, Berkley Shands wrote:
> 100% reproducible on the two motherboards in question.
> Does not happen on any other motherboard I have in my possession
> (not tyan, not uniwide, not socket 940...)
>
> No errors, no dmesg, nothing with debug_spinlock set.
> <sysrq> shows lots (when it works), but by then too many things are
> locked up to be of much use. I can get into KDB and look around
> (2.6.22 for kdb - it hangs there too). Even access to the local disk is
> blocked.
> Processes in core and running remain there (iostat, top, ...).
> I personally think the bios are suspect on the PCIe, as symptoms change
> with the bios rev. I did a major paper on SAS performance with one H8DMi,
> but it got a bios rev, and now crashes. Missed interrupt? APIC sending an
> interrupt to multiple cpus? I don't know.
>
> Tell me what to look at, and I can get you the info. It usually takes 20
> seconds
> to go bang, using either the LSI8888ELP or the rocket raid 2340. Other
> controllers
> are too slow. 2.6.20 does not lock up. It is also 200MB/Sec slower in
> writing :-)
>
> thanks for the response.
OK, it does sound suspiciously like a hardware bug, or some
unrelated software bug that is causing memory scribbles...
A few things you could do.
One is that you could verify that it indeed is the kswapd_wait
spinlock that it is spinning on, and then when you see the lockup,
you could verify that no other tasks are holding the lock. (it is
quite an inner lock, so you shouldn't have to wade through call
chains...). That would confirm corruption. Dumping the lock
contents and the fields in the structure around the lock might
give a clue.
You could put the spinlock somewhere else and see what happens
(move it around in the structure, or get even more creative...).
or do something like have 2 spinlocks, and when you encounter
the lockup, verify whether or not they agree.
(It sounds like you're pretty capable, but if you want me to have
a look at doing a patch or two to help, let me know.)
Another is to bisect the problem, however as you say the kernel
is going slower, so you may just bisect to the point where it
is sustaining enough load to trigger the bug, so this may not be
worth you time just yet.
You could _try_ turning on slab debugging. If there is random
corruption, it might get caught. Maybe it will just change
things enough to hide the problem though.
Thanks for reporting!
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/