Re: Machine dies under heavy I/O or network-access ..?

From: Andrew Morton
Date: Wed Sep 07 2005 - 04:20:43 EST


"Christiaan den Besten" <chris@xxxxxxxxxxx> wrote:
>
> We have recently installed two new Usenet feeders, but are having issue's keeping them alive under 'heavy' load. Both are SuperMicro
> based models with onboard Intel GB NICS and have a Areca 16 ports SATA-II controller. Both are Dual Xeon 2.8 with 4G ram, swap
> disabled. Filesystems are tmpfs, ext3 and xfs.
>
> Using plain vanilla 2.6.12+ these machines will usually crash within an hour or so. With -mm patched kernels this timeframe
> increases somewhat (seen 5 days). ( Crash means: none of the processes are responding, but machines does reply ping requests ... ).
> I have not yet been able to connect a serial for logging, and the screen does not come out of screen blanking on keyboard access...
>
> These machines handle aprox 200mbit on eth0 and 400mbit on eth1, the Areca driver will write +/- 20Mb/s to the disks and sometimes
> have to read upto 60Mb/s as well (just to give an impression of what the load is .. )
>
> Today I noticed the following assertion in dmesg:
>
> ---
> e1000: eth0: e1000_watchdog_task: NIC Link is Up 1000 Mbps Full Duplex
> KERNEL: assertion (!sk->sk_forward_alloc) failed at net/core/stream.c (279)
> KERNEL: assertion (!sk->sk_forward_alloc) failed at net/ipv4/af_inet.c (148)
> ---

Add some swapspace.



Before it hangs, do

echo 1 > /proc/sys/kernel/sysrq

and test that alt-sysrq-T produces output.

When it hangs, do alt-sysrq-p a few times and record the result. Then do
alt-sysrq-t a few times and record the result.

You'll probably need a serial console to record the results.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/