RE: [PATCH v3 1/2] x86, mce, severity: extend the the mce_severity mechanism to handle UCNA/DEFERRED error

From: Luck, Tony
Date: Tue Nov 11 2014 - 13:44:27 EST


>> The bank 7 error reported as severity 0 because EN=0 ... so we took no action for it.
>
> How come EN is 0? Bank7 error reporting is not enabled? Why? Or the
> error injection thing doesn't do it?

The "EN" bit is poorly named, and not well documented. Here's a clip from the SDM:

One of bullets in 15.10.4.1 Machine-Check Exception Handler for Error Recovery

When the EN flag is zero but the VAL and UC flags are one in the
IA32_MCi_STATUS register, the reported uncorrected error in this bank
is not enabled. As uncorrected errors with the EN flag = 0 are not the
source of machine check exceptions, the MCE handler should log and clear
non-enabled errors when the S bit is set and should continue searching
for enabled errors from the other IA32_MCi_STATUS registers. Note that
when IA32_MCG_CAP [24] is 0, any uncorrected error condition (VAL =1
and UC=1) including the one with the EN flag cleared are fatal and the
handler must signal the operating system to reset the system. For the
errors that do not generate machine check exceptions, the EN flag has
no meaning. See Chapter 19: Table 19-15 to find the errors that do not
generate machine check exceptions.

Unfortunately the reference to chapter 19 is stale (that is now all about
performance monitoring - I'll log a bug with the SDM editor to find the
right reference and fix this).

What this is trying to say is that the "EN" bit is to enable signaling
of machine checks - so it only has meaning when checking banks from the
machine check handler. Errors that are logged, but not signaled, or signaled
as CMCI will have MCi_STATUS.EN=0


>> The bank 3 error got past that hurdle, then through the next BIT(8) set indicates a
>> cache error. Fell at the last check because ADDRV=0.
>
> I guess you could tweak the injection path to write in a default address
> so that that check gets bypassed...

I don't think this is an injection artifact. I think on this processor the mid-level-cache
just isn't providing an address in this case. It doesn't help to make one up - our whole
game plan is to offline a page with a UC error - and we must have an address to know
which page to offline.

Perhaps the severity table entries for UCNA and DEFERRED errors should look to see
if ADDRV is set - if not, don't report this as UCNA/DEFERRED?

-Tony