RE: [PATCH v3] x86/mce: Try printing all machine check banks known before panic

From: Luck, Tony
Date: Fri Nov 21 2014 - 17:00:30 EST


>> That means there were no VALID=1, EN=1, S=1 errors anywhere. But there
>> might be some other things logged that would help us understand.
>
> By "other things" you mean other MCEs?

Logs with EN=0 and/or S=0. They may have interesting information, and have
a good chance of being useful (especially if they are from some functional
unit that isn't part of the buggy behavior. Bad data flowing through multiple
functional units can leave a trail of logged entries (perhaps as many as four
units may see and log a single error). Only one of them should signal the machine
check (to avoid shutdown because of nested machine check).

> Oh, cpu errata. So this would mean that we can't even rely on the
> contents of the MCA banks, can we?
>
> In any case, is any of the information in the MCA banks in such cases
> even usable then? Because if not, we're definitely barking up the wrong
> tree...

See above - I think even if there is a bug in the core that isn't setting the
right bits in the MCi_STATUS register - we could get good data from
devices out in the uncore.

-Tony