Re: Processes stuck on D state on Dual Opteron

From: Nick Piggin
Date: Mon Apr 11 2005 - 04:57:26 EST


Claudio Martins wrote:
On Sunday 10 April 2005 03:47, Andrew Morton wrote:

Suggest you boot with `nmi_watchdog=0' to prevent the nmi watchdog from
cutting in during long sysrq traces.

Also, capture the `sysrq-m' output so we can see if the thing is out of
memory.


Hi Andrew,

Thanks for the tip. I booted with nmi_watchdog=0 and was able to get a full sysrq-t as well as a sysrq-m. Since it might be a little too big for the list, I've put it on a text file at:

http://193.136.132.235/dl145/dump1-2.6.12-rc2.txt


OK, you _may_ be out of memory here (depending on what the lower zone
protection for DMA ends up as), however you are well above all the
"emergency watermarks" in ZONE_NORMAL. Also:

I also made a run with the mempool-can-fail patch from Nick Piggin. With this I got some nice memory allocation errors from the md threads when the trouble started. The dump (with sysrq-t and sysrq-m included) is at:

http://193.136.132.235/dl145/dump2-2.6.12-rc2-nick1.txt


This one shows plenty of memory. The allocation failure messages are
actually a good thing, and show that my patch is sort of working. I
have reworked it a bit so they won't show up though.

So probably not your common or garden memory deadlock.

The common theme seems to be: try_to_free_pages, swap_writepage,
mempool_alloc, down/down_failed in .text.lock.md. Next I would suspect
md/raid1 - maybe some deadlock in an uncommon memory allocation
failure path?

I'll see if I can reproduce it here.

--
SUSE Labs, Novell Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/