RE: [PATCH v3] x86/mce: Try printing all machine check banks known before panic

From: Luck, Tony
Date: Wed Nov 19 2014 - 18:34:17 EST


>> No information besides that it is a machine check. This happens in two cases:
>> 1) The CPU logs the error with the MCi_STATUS.EN bit set to zero, and Linux
>> ignores EN=0 entries (as it should).

> Well, I guess we shouldn't anymore. Apparently hw forgets to set the
> bit when raising an MCE so then we should ignore it too in mce-severity
> and delete that piece or grade it as higher severity based on, I dunno,
> b0rked hardware family/model/stepping or whatever bit we set...
>
> MCESEV(
> NO, "Not enabled",
> BITCLR(MCI_STATUS_EN)
> ),

The SDM has this to say about EN=0 (in section 15.10.4.1 of volume 3B):

When the EN flag is zero but the VAL and UC flags are one in
the IA32_MCi_STATUS register, the reported uncorrected error
in this bank is not enabled. As uncorrected errors with the
EN flag = 0 are not the source of machine check exceptions,
the MCE handler should log and clear non-enabled errors when
the S bit is set and should continue searching for enabled
errors from the other IA32_MCi_STATUS registers. Note that
when IA32_MCG_CAP [24] is 0, any uncorrected error condition
(VAL =1 and UC=1) including the one with the EN flag cleared
are fatal and the handler must signal the operating system to
reset the system. For the errors that do not generate machine
check exceptions, the EN flag has no meaning.

Note the "should log and clear". We just clear ... just need to shuffle some code
in mce.c to add the logging.

But we still need something like Rui's patch - calling mcelog() doesn't ensure that
we see something on the console about possible cause of the problem.

-Tony