Re: [HW PROBLEM] Intel I7 MCE. Erratum or not?

From: Giangiacomo Mariotti
Date: Mon Dec 08 2008 - 03:04:47 EST


On Mon, Dec 8, 2008 at 8:42 AM, Hidetoshi Seto
<seto.hidetoshi@xxxxxxxxxxxxxx> wrote:
> Giangiacomo Mariotti wrote:
>> I noticed something else, which though may be due to my inexperience
>> with mce messages. In my directory /sys/devices/system/machinecheck
>> there are machinecheck0-7(one for each logical cpu of my system I
>> presume). Having received the MCE log always for cpu 0, I went to look
>> inside dir machinecheck0 and I found bank0-5ctl. So now my question
>> is, why do I receive MCE logs about bank 6, if my cpus don't have a
>> bank 6? Does that count start from 1? Or am I missing something else?
>
> Answer would be in the following commit:
>
>> commit 8edc5cc5ec880c96de8e6686fb0d7a5231e91c05
>> Author: Venki Pallipadi <venkatesh.pallipadi@xxxxxxxxx>
>> Date: Mon May 12 15:43:34 2008 +0200
>>
>> x86: remove 6 bank limitation in 64 bit MCE reporting code
> (snip)
>> The patch below does not create sysfs control (bankNctl) for banks
>> higher than 6 as well. That needs some pre-cleanup in /sysfs mce layout,
>> removal of per cpu /sysfs entries for bankctl as they are really global
>> system level control today. That change will follow. This basic change
>> is critical to report the detailed errors on banks higher than 6.
>
> So there are 6 sysfs control(bank0-5ctl) even if your cpu have more banks.
>
> Old kernel with bank limitation will say:
> "MCE: warning: using only %d banks\n"
> And it seems that old kernel will ignore records in banks higher than 6.
>
> Thanks,
> H.Seto
>
>
I see, thanks for the info.
I still don't quite understand the logic behind this exception. It
happens always only once per boot, right after booting always at [
301.7320xx], which clearly means that it's always triggered by the
same instruction/s. It's about a "Generic CACHE Level-2 Data-Write
Error", yet after that moment it never happens again until the next
boot at the same relative time. The cache has an hardware problem, the
process context is corrupted, but still after that single message I
don't have any problem, my system works normally, even under very high
pressure on cpu and memory. Is this normal? Should I try to limit the
number of cpu used to only 1(cpu0) on bios and disable hyperthreading?
That way I'd have a single physical and logical cpu, so probably if it
has an hardware problem on the cache, the heaven will fall?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/