Re: AMD A10: MCE Instruction Cache Error

From: Borislav Petkov
Date: Sat Nov 03 2012 - 00:49:30 EST


On Fri, Nov 02, 2012 at 02:53:45PM +0100, Alexander Holler wrote:
> Am 02.11.2012 11:50, schrieb Alexander Holler:
> >Hello,
> >
> >I've just got the following on an AMD A10 5800K:
> >
> >------
> >[ 8395.999581] [Hardware Error]: CPU:0
> >MC1_STATUS[-|CE|MiscV|-|AddrV|-|-]: 0x8c00002000010151
> >[ 8395.999586] [Hardware Error]: MC1_ADDR: 0x0000ffffa00e1203
> >[ 8395.999588] [Hardware Error]: Instruction Cache Error: Parity error
> >during data load from IC.
> >[ 8395.999590] [Hardware Error]: cache level: L1, tx: INSN, mem-tx: IRD
> >------
> >
> >Kernel is 3.6.5, MB is an Asus F2A85-M with BIOS 5103 (the latest).
> >
> >Can someone enlight me about what might be wrong with my (new) system
> >(memtest didn't show an errors)?
> >
> >What IC is meant? As far as I know, this processor doesn't support ECC,
> >so I wonder where that parity error does come from.
>
> I assume IC means Instruction Cache. ;)

It says so earlier in the sentence: "Instruction Cache Error" :)

> As the kernel didn't reboot or halt, this seems to have been a
> correctable error.

Yes, it is (the "CE" bit in MC1_STATUS). Btw, I have reworked this code
to spit human-readable information first. It also says what the error
severity is now.

> Which leads me to another question. I have mcelog running, but it
> doesn't seem to receive the error. With my previous Intel-HW and an
> older kernel, mcelog received MCE errors (trip temperatur). But
> since the kernel now decodes those message themself, that doesn't
> seem to happen anymore. mcelog is silent, but now I've seen the
> above message on all my consoles.

Yes, AMD doesn't use mcelog.

> So now I have two question:
>
> - First, if the error is something I should ask AMD about,

Not really, it is a single bit flip which got corrected, simply watch
out if you get more of those.

> - Second, if the kernel could mention that it is an recoverable
> error. And if so and if such errors aren't something to get panic
> (e.g. it isn't unusual to receive such), if the kernel could output
> that message with another priority.

As I said above, it got corrected. If it were critical, it would've
either panicked or you wouldnt've seen it at all (probably only after
reboot).

HTH.

--
Regards/Gruss,
Boris.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/