RE: [PATCH v2 6/6] x86/mce: Dynamically register default MCE handler

From: Ghannam, Yazen
Date: Tue Jan 07 2020 - 23:24:41 EST

> -----Original Message-----
> From: Borislav Petkov <bp@xxxxxxxxx>
> Sent: Friday, January 3, 2020 5:03 PM
> To: Jan H. SchÃnherr <jschoenh@xxxxxxxxx>
> Cc: Ghannam, Yazen <Yazen.Ghannam@xxxxxxx>; linux-kernel@xxxxxxxxxxxxxxx; linux-edac@xxxxxxxxxxxxxxx; Tony Luck
> <tony.luck@xxxxxxxxx>; Thomas Gleixner <tglx@xxxxxxxxxxxxx>; Ingo Molnar <mingo@xxxxxxxxxx>; H. Peter Anvin <hpa@xxxxxxxxx>;
> x86@xxxxxxxxxx
> Subject: Re: [PATCH v2 6/6] x86/mce: Dynamically register default MCE handler
> On Fri, Jan 03, 2020 at 04:07:22PM +0100, Jan H. SchÃnherr wrote:
> > On the other hand, I'm starting to question the whole logic to "only print
> > the MCE if nothing else in the kernel has a handler registered". Is that
> > really how it should be?
> Yes, it should be this way: if there are no consumers, all error
> information should go to dmesg so that it gets printed at least.
> > For example, there are handlers that filter for a specific subset of
> > MCEs. If one of those is registered, we're losing all information for
> > MCEs that don't match.
> Probably but I don't think there's an example of an actual system where
> there are no other MCE consumers registered. Not if its users care about
> RAS. This default fallback was added for the hypothetical case anyway.
> > A possible solution to the latter would be to have a "handled" or "printed"
> > flag within "struct mce" and print the MCE based on that in the default
> > handler. What do you think?
> Before we go and fix whatever, we need to define what exactly we're
> fixing. Is there an actual system you're seeing this on or is this
> something that would never happen in reality? Because if the latter, I
> don't really care TBH. As in, there's more important stuff to take care
> of first.

I've encountered an issue where EDAC didn't load (either due to a bug or
missing hardware enablement) and the MCE got swallowed by the mcelog notifier.
The mcelog utility wasn't in use, so there was no record of the MCE. This can
be considered a system configuration issue though that can be resolved with a
bit of tweaking. But maybe we can find a solution to prevent something like