Re: [PATCH v2 1/6] x86/mce: Take action on UCNA/Deferred errors again

From: Borislav Petkov
Date: Sat Jan 11 2020 - 08:18:31 EST


On Fri, Jan 10, 2020 at 10:45:33AM -0800, Luck, Tony wrote:
> I totally agree that counting notifiers is clumsy. Also less than
> ideal is the concept that any notifier on the chain can declare:
> "I fixed it"
> and prevent any other notifiers from even seeing it. Well the concept
> is good, but it is overused.

But why can't we use it?

Don't get me wrong: I'm simply following my KISS approach to do the
simplest scheme required. So, do you see a use case where the whole
error handling chain would need more sophisticated handling?

> I think we may do better with a field in the "struct mce" that is being
> passed to each where notifiers can wiggle some bits (semantics to be
> defined later) which can tell subsequent notifiers what sort of actions
> have been taken.
> E.g. the SRAO/UCNA notifier can say "I took this page offline"
> the dev_mcelog one can say "I think I handed to a process that has /dev/mcelog open"
> EDAC drivers can say "I decoded the address and printed something"
> CEC can say: "I silently counted this corrected error", or "error exceeded
> threshold and I took the page offline".
>
> The default notifier can print to console if nobody set a bit to say
> that the error had been somehow logged.

That idea is good and I'll gladly take patches for it so if you wanna do
it...

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette