Re: PROBLEM: mce: [Hardware Error] from dmesg -l emerg

From: Luck, Tony
Date: Mon May 21 2018 - 12:03:23 EST


On Mon, May 21, 2018 at 05:31:52PM +0530, Jeffrin Thalakkottoor wrote:
> > Ok, but please do not top-post.
>
> Ok
>
> > Looks like mcelog has trouble decoding this. Have you updated mcelog to
> > the latest version in your distro?
> .
> mcelog 153+dfsg-1

So this is looking like another case where an error is
logged during BIOS bringup, and Linux finds the error
when it scans all machine check banks during boot.

The earlier logs you sent showed a value of ee0000000040110b
in the machine check bank status register. Not sure why
mcelog had trouble with this(*).

Upper bits say: VALID OVER UC MISCV ADDRV

Low 16 (MCACOD) bits say: FILTER CACHE ERR GENERIC LEVEL=3

So BIOS did something to trigger some issues in the L3
cache (more than once since the overflow and filter bits
are both set).

I think (but am not 100% sure because I don't have an
internal decoder that knows about this specific CPU model)
that the error was a write-back to MMIO (this matches other
cases where we've seen BIOS trigger some error and left the
logs for Linux to find at boot). It's not quite the same
because the address logged for you is 160000080, where the
previous cases has addresses below 4GB. But some platforms
include MMIO above 4GB, so this is still plausible.

Advice we have given before is to attempt to log a bug
against the BIOS with the vendor of your system. But the
last person to try this reported no success.

Or, you could ignore it. It appears to not have any
side effects.

-Tony

(*) Can you send a snip from the raw dmesg output that starts
a couple of lines before:


... [Hardware Error]: CPU 0: Machine Check: 0 Bank: 5 ...

and continues a couple of lines past

... [Hardware Error]: PROCESSOR 0:306d4 ...

and I'll take a look at why mcelog choked.