Re: kernel panic at load average of 24 is it acceptable ?

From: Vikas Kedia
Date: Mon Jul 17 2006 - 04:06:32 EST


Read up on MCE debugging methods on Linux or so, that should hopefully help.

Here is the output of mcelog:
root@srv1:~# less /var/log/mcelog
MCE 0
CPU 0 0 data cache TSC 6988ae18046
ADDR f87f5ec0
Data cache ECC error (syndrome ce)
bit46 = corrected ecc error
bus error 'local node origin, request didn't time out
data read mem transaction
memory access, level generic'
STATUS 9467400000000833 MCGSTATUS 0
MCE 0
CPU 0 0 data cache TSC 723b38a3633
ADDR 3d9fc0
Data cache ECC error (syndrome ce)
bit46 = corrected ecc error
bit62 = error overflow (multiple errors)
bus error 'local node origin, request didn't time out
data read mem transaction
memory access, level generic'
STATUS d467400000000833 MCGSTATUS 0

Since it shows ECC error is the hypothesis correct that its the RAM
problem and replacing it should solve the problem.

Regards,

Vikas

On 7/17/06, Andreas Mohr <andi@xxxxxxxxxxxxxxxxxxxxxxx> wrote:
Hi,

On Mon, Jul 17, 2006 at 12:08:41AM -0700, Vikas Kedia wrote:
> The memtest ran fine for 8 hours:
> http://www.visitlab.com/styles/basic/images/memtest.JPG
>
> My questions are:
> 1. Kernel panic at load average of 24 is it acceptable ?

Kernel panic is _NEVER_ acceptable.
I've seen loads in the couple hundreds with no problem.

However you actually have a mce_panic() crash here.
Make sure to figure out why this Machine Check Exception got raised,
otherwise you might hose the box if you continue without investigation.
It could easily be due to mal-working CPU fan etc.pp., especially since it
happened exactly while you stress-tested the machine.

> 2. If not how do I go about debugging this kernel panic ?

Read up on MCE debugging methods on Linux or so, that should hopefully help.

Good luck!

Andreas Mohr

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/