Re: Hardware Error Kernel Mini-Summit

From: Tony Luck
Date: Tue May 18 2010 - 17:37:17 EST


On Tue, May 18, 2010 at 1:42 PM, Ingo Molnar <mingo@xxxxxxx> wrote:
>> I proposed a (fairly straightforward) extension to which
>> Boris agreed: we can introduce 'persistent events',
>> which have task-less buffers attached to them, which
>> will hold (a configurable amount of) of events.
>>
>> Those can then be picked up by a task later on and no
>> event is lost.
>>
>> Would such a feature address your concern?
>
> Tony, should we accelerate the development of this
> persistent events sub-feature?

The persistent event feature sounds like it will solve
the early logging issue.

> Boris posted initial patches of the new perf events based
> EDAC/MCE/RAS design direction to lkml and indicated that
> it works for him. He also indicated that he can do the
> initial work of unifying EDAC and MCE without the
> persistent events feature for now. (this all is obviously
> v2.6.36-ish material)
>
> But if it's important, if you'd like to move ahead with
> the unification swiftly then we can certainly increase its
> priority.

We've missed the deadlines for inclusion in certain
popular distributions ... so it may be OK to take a
relatively leisurely path to getting this done right
rather than rushing.

> 3) Another new perf feature of interest is 'perf inject'
> (this too went upstream today): to inject artificial
> events into the stream of events. This mechanism could be
> used to simulate rare error conditions and to test out
> policy reactions systematically - an important part of
> system error recovery testing.

Simulated errors are handy for testing the very
top level of the s/w stack. But real errors are
better. There's some APEI code in Len's tree
that can inject real errors (on systems with the
right BIOS hooks enabled).

> This gives us a broad platform to add various RAS events
> as well, beyond raw hardware events: we could for example
> events for various system anomalies such as lockup
> messages, kernel warnings/oopses, IOMMU exceptions - maybe
> even pure software concepts such as fatal segmentation
> fault events, etc. etc.

This looks like sticky ground. I can see the event mechanism
passing data to a user daemon working well for all kinds of
corrected and minor errors. But when you start talking about
lockups and fatal errors things get a lot trickier. Often the
main concern at this point is error containment. Making sure
that the flaky data doesn't become visible (saved to storage,
transmitted to the network, etc.). Getting from a machine check
handler through some context switches (and page
faults etc.) to a user level daemon before the error
gets recorded looks to be really hard.

> That way the RAS daemon could build and utilize a complete
> and coherent set of events it wants to subscribe to - all
> via the same event transport mechanism. It would thus have
> a comprehensive 'system health' view, via a single,
> reliable mechanism, and could act in a wide range of
> scenarios, with a wide range of policy actions, based on a
> very complete picture.

In a cluster/cloud/datacenter that daemon will need to be
networked and hooked to the system management tools
that are controlling the bigger environment. But I agree
that this looks like a worthy end goal.

-Tony
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/