Re: [PATCH 07/14] mce3: pass mce info to EDAC for decoding

From: Ingo Molnar
Date: Tue Aug 04 2009 - 10:45:52 EST



* Borislav Petkov <borislav.petkov@xxxxxxx> wrote:

> On Tue, Jul 21, 2009 at 08:51:28AM +0200, Andi Kleen wrote:
> > On Tue, Jul 21, 2009 at 12:41:34PM +0900, Hidetoshi Seto wrote:
> > > H. Peter Anvin wrote:
> > > > If you want modules to change the behavior, you're talking about a
> > > > *dynamic* change -- the call will point to different things at different
> > > > points in time -- so you need another mechanism, i.e. function pointers.
> > >
> > > Just FYI, machine check handler on ia64 has such function pointer.
> > >
> > > [arch/ia64/kernel/mca.c]
> > > 826 /* Function pointer for extra MCA recovery */
> > > 827 int (*ia64_mca_ucmc_extension)
> > > 828 (void*,struct ia64_sal_os_state*)
> > > 829 = NULL;
> >
> > A notifier would be a much more flexible solution. Function
> > pointers don't really work well with multiple users, which might
> > well happen here.
> >
> > However on the other hand I have some doubts it's really a good
> > idea to expose fatal MCEs to modules. MCE is a rather critical
> > code path (a bit similar to an oops), with the machine already
> > somewhat instable in many cases and if you allow arbitary
> > modules to hook into that you risk long term instability.
> >
> > So if a notifier is done I would recommend to only limit it to
> > corrected MCEs (machine_check_poll), not fatal ones.
>
> However, the idea is to decode _all_ MCEs so we could look into
> moving the decoding bits into the EDAC core or some other more
> appropriate place. Ingo?
>
> We could then reroute the non fatals to EDAC for further decoding.

Yep, obviously the kernel entity with the wider view (which is EDAC
here) should interpret such errors and decide policy or do some good
default actions. The arch level MCE code is basically a lowlevel
platform driver to the EDAC code.

And please dont do stupid notifiers. They are opaque and cause
various problems. Integrate the code for good - there's no technical
reason to keep it all separate.

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/