On Thu, Oct 09, 2014 at 11:53:39AM -0500, Aravind Gopalakrishnan wrote:
How do you mean "last error"?And? Think about it, what is causing the overflow? A CE, right?
The interrupt is only fired upon overflow..
There was even a call to machine_check_poll() there which we removed,
but for another reason. In any case, you should have the error signature
in the MCA banks of the last error causing the overflow, right?
This isThat's right. Might as well remove it.
what I mean with last error.
However(!),...
CE error if collected through polling gives proper decoding info. So,... we're currently using a special signature to signal the overflow
why should this be any different for the same CE error for which an
interrupt is generated on crossing a threshold?
with the K8_MCE_THRESHOLD_BASE thing. You simply report a special bank
and this way you can tell userspace that this is an overflow error. I
think that was the reason behind the software-defined banks.
Now, we can also drop that and simply log a normal error but make sure
MASK_OVERFLOW_HI is passed onto userspace so that it can see that the
error is an overflow error. I.e., something like this:
mce_setup(&m);
// rdmsrl(MSR_IA32_MCG_STATUS, m.mcgstatus); - not sure about this one - we're not looking at MCGSTATUS for CEs
// rdmsrl(address, m.misc); - this MSR can be saved too as we're reading
// the MISC register already.
rdmsrl(MSR_IA32_MCx_STATUS(bank), m.status);
m.bank = bank;
mce_log(&m);
so in the end it'll be something like this:
mce_setup(&m);
m.misc = (high << 32) | low;
rdmsrl(MSR_IA32_MCx_STATUS(bank), m.status);
m.bank = bank;
mce_log(&m);
so I'm still on the fence about what we want to do and am expecting
arguments.
I like the last one more because it is simpler and tools
don't need to know about the software-defined banks.