Re: [NAK] Re: [PATCH -v2 9/9] ACPI, APEI, Generic Hardware ErrorSource POLL/IRQ/NMI notification type support

From: Ingo Molnar
Date: Mon Oct 25 2010 - 09:48:13 EST



* Andi Kleen <andi@xxxxxxxxxxxxxx> wrote:

> On Mon, Oct 25, 2010 at 02:55:31PM +0200, Ingo Molnar wrote:
> >
> > * Andi Kleen <andi@xxxxxxxxxxxxxx> wrote:
> >
> > > On Mon, Oct 25, 2010 at 01:15:30PM +0200, Ingo Molnar wrote:
> > >
> > > > > > > einj.c: it's about the 3rd separate 'error injection' concept that got
> > > > > > > introduced ...
> > > > > >
> > > > > > EINJ is a true platform feature, not just software feature. We need to support
> > > > > > it to debug various hardware error features.
> > > > >
> > > > > Also having multiple error injecting interfaces is a good thing.
> > > >
> > > > It's never a good thing to have separate, vendor dependent interfaces for what
> > > > to the user is basically the same conceptual thing!
> > >
> > > Perhaps a simple example (simplified, in practice there are more complications)
> > > makes it more clear:
> > >
> > > The memory error handler does different actions depending on what the state the
> > > page the error is happening on is in.
> >
> > What you appear to be arguing for is the ability to inject different types of
> > events.
>
> Different events in different contexts with different drivers with different
> parameters [...]

Correct.

> [...] using different tools.

That's possible, but i'd expect tools/ras/ to be populated with uniformly working
tools. There's little sense in fragmenting the hw-testing field...

> Commonality: about 0% exept there's "error" somewhere in the description.

Wrong. Their main purpose is common: they are events attached to existing hardware
topologies, which events can be configured, which events can be received and which
can be injected with attributes for rare-event simulation purposes.

The tool people have spoken to us clear and loud that they want to _receive_ events
in a unified and structured way - not via lots of separate ABIs from facilities that
have mismatching capabilities.

We want to be able to inject _other_ events as well, not just hw-error ones -
especially rare ones.

I.e. there's clear, demonstrated, patches-pending demand for uniformity and there's
similar demand for a broader concept.

You are now making the point that somehow the receipt and sending/injecting of 'hw
errors on Intel hardware' should be a separate, fragmented, disoriented, messy piece
of interface design, closely matching some ACPI spec detail, which should be
disassociated from the preferred mechanism of error reporting?

Your argument makes absolutely no sense to me.

The kernel is an abstraction machine: common hw aspects should be generalized to the
extent it makes sense, with reasonable extensions for anything we dont want (or
cannot) generalize.

There's _tons_ of interesting structure here to be taken advantage of: just look at
what Boris is trying to achieve with his EDAC tooling patches. See what Lin Ming is
trying to do by moving event descriptors to /sys, so that events can come with
elements of our hw and sw topology in a natural way.

There is absolutely no justification whatsoever for the new /dev/erst-dbg ABI ...

Furthermore, you have ignored my other argument for the second time now: why does
this code not do the event _reporting_ via the facilities we use and prefer? As far
as users are concerned, the ability to receive hardware error events in a unified
way is an even more important aspect than the matter of event injection.

Once you do that i think you will see how naturally error injection fits into the
picture as well. It is an aspect of pretty much any event (not just hw-error events)
that we want to be able to 'inject/simulate' them, to test tools.

Your refusal to even consider this possibility and to look at the EDAC/RAS patches
that deal with this is puzzling to me.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/