Re: [HW PROBLEM] Intel I7 MCE. Erratum or not?

From: Robert Hancock
Date: Sat Dec 06 2008 - 22:26:19 EST


Giangiacomo Mariotti wrote:
On Sat, Dec 6, 2008 at 10:47 PM, Robert Hancock <hancockr@xxxxxxx> wrote:
Giangiacomo Mariotti wrote:
On Sat, Dec 6, 2008 at 9:58 PM, Robert Hancock <hancockr@xxxxxxx> wrote:
Giangiacomo Mariotti wrote:
Hi everyone,
Mcelog just logged on my new Intel I7 920 (on Linux 2.6.27.8) this :
MCE 0
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 0 BANK 6 MISC 202d ADDR ffeef740
MCG status:
MCi status:
Error overflow
Uncorrected error
MCi_MISC register valid
MCi_ADDR register valid
Processor context corrupt
MCA: Generic CACHE Level-2 Data-Write Error
STATUS ee0000000100014a MCGSTATUS 0

I'm reporting this here, because I found in the Intel I7 Technical
Specification November 2008 update that something which seems very
similar is in fact an erratum. So my question is : Is there any way
for me to verify that my problem is due to one of those errata,instead
of a broken hardware(if we don't want to consider all those errata as
broken hardware)? I'm also reporting this because I thought it may be
useful to signal that(if actually due to those errata) these problems
actually occur, so it may be useful to find workarounds in the kernel
to not scare to death poor Linux users!
Which erratum are you talking about? I don't see one in that document
that
would match this case..

Well, the first one seems very similar, even if it talks about a dtlb
error instead of cache error. But sure,being similar doesn't mean too
much. Number 52 seems similar too. I guess I should just give up and
admit that my hardware is broken!

The first one is just indicating that if a DTLB error occurs the overflow
bit may be set incorrectly. It's not a false error though. The AAJ52 erratum
would only occur immediately after powerup or wake from sleep states.

The mce actually got logged once immediately after powerup and never
more. Is that reasonable? A cache error which happens just once after
boot?

The erratum refers to an internal parity error, not an L2 cache write error.

If it only happened once then who knows, could be a cosmic ray or something.. but if it happens again it sounds like you likely have a bad CPU.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/