Re: [PATCH 4/5] x86/mce: Fix all mce notifiers to update the mce->handled bitmask

From: Andy Lutomirski
Date: Thu Feb 13 2020 - 17:27:46 EST


On Thu, Feb 13, 2020 at 2:19 PM Luck, Tony <tony.luck@xxxxxxxxx> wrote:
>
> On Thu, Feb 13, 2020 at 06:03:08PM +0100, Borislav Petkov wrote:
> > On Wed, Feb 12, 2020 at 12:46:51PM -0800, Tony Luck wrote:
> > > If the handler took any action to log or deal with the error, set
> > > a bit int mce->handled so that the default handler on the end of
> > > the machine check chain can see what has been done.
> > >
> > > [!!! What to do about NOTIFY_STOP ... any handler that returns this
> > > value short-circuits calling subsequent entries on the chain. In
> > > some cases this may be the right thing to do ... but it others we
> > > really want to keep calling other functions on the chain]
> >
> > Yes, we can kill that NOTIFY_STOP thing in the mce code since it is
> > nasty.
>
> Well, there are places where we want to keep NOTIFY_STOP.

I very very strongly disagree.

>
> 1) Default case for CEC. We want it to "hide" the corrected error.
> That was one of the main goals for CEC. We've discussed cases
> where CEC shouldn't hide (when internal threshold exceeded and
> it tries to take a page offline ... probably something related to
> CMCI storms ... though we didn't really come to any conclusion)

Then put this logic in do_machine_check() or in some sensible place
that it calls via some ops structure or directly. Don't hide it in
some incomprehensible, possibly nondeterministic place in a notifier
chain.

> 2) Errata. Perhaps a vendor/platform specific function at the head
> of the notify chain that weeds out errors that should never have
> been reported.

No, do this before the notifier chain please.

AMA Capital Management, LLC