Re: [RFC PATCH 0/5] New way to track mce notifier chain actions

From: Andy Lutomirski
Date: Thu Feb 13 2020 - 00:52:53 EST


On Wed, Feb 12, 2020 at 3:08 PM Luck, Tony <tony.luck@xxxxxxxxx> wrote:
>
> On Wed, Feb 12, 2020 at 12:46:47PM -0800, Tony Luck wrote:
> > Part 4 is where things are interesting and need a great deal more
> > thought. A bunch of things on the chain return NOTIFY_STOP which
> > prevents anything else on the chain from being run. For the moment
> > I ignored that semantic and added code everywhere to set the BIT
> > even though nobody else will see it. This is because I think at
> > least some of them should NOT be NOTIFY_STOP.
>
> NOTIFY_STOP is just one mechanism for preventing every function
> on the mce chain from reporting an error.
>
> The other bit I'd like to reconsider is edac_get_report_status().
> Back in the day we seemed to be paranoid about reporting the same
> error more than once via all the different reporting mechanisms.
>
> Since then I've had to track down numerous "Why didn't this error
> get reported?" questions that frequently resolved to "It was reported,
> but not in the place that you expected".
>
> So now my attitude is "Let's just log it everywhere in so that
> whatever log the user is checking, they'll find the error"

I HATE notifier chains for exceptions, and I REALLY HATE NOTIFY_STOP.
I don't suppose we could rig something up so that they are simply
notifiers (for MCE and, eventually, for everything) and just outright
prevent them from modifying the processing?

As an example that particularly bothers me, do_debug():

if (notify_die(DIE_DEBUG, "debug", regs, (long)&dr6, error_code,
SIGTRAP) == NOTIFY_STOP)
goto exit;

There is all kind of garbage hidden in there, and it's mostly
somewhere between slightly buggy and violently buggy. All this crap
should be open-coded.