Re: [RFC 0/9] mce recovery for Sandy Bridge server
From: Borislav Petkov
Date: Tue May 24 2011 - 04:14:47 EST
On Tue, May 24, 2011 at 05:40:23AM +0200, Ingo Molnar wrote:
> So we *really* want to promote this code to a higher level of abstraction.
> Everyone would benefit from doing that: Intel hardware error handling features
> would be enabled much more richly and i suspect they would also be *used* in a
> much more meaningful way - driving the hw cycle as well.
Absolutely agreed. The RAS architecture should look like this, IMHO:
I. Event collection: #MC handler and pollers, no queueing or buffering crap.
II. Pluggable and extensible filters which are
* per vendor
* configurable from userspace
* easily extensible
* decide whether action should be taken in the kernel or error is non-critical
and should go to RAS daemon
III. Error handling callback(s)
* also extensible
* also per vendor
* also configurable from userspace
Advantages:
* reuse perf code - no need for ad-hoc new buffers and lockless thingies when we
have it all already
* easy code and even hw testing with perf inject or ras inject
** this gives us also the different injection methods per vendor in an unified
way instead of interfaces in /sys or debugfs or mcelog or ...
* keep code design sane instead of letting it needlessly fiddle with
other parts of the kernel
* ...
Now I should better go and put my patches where my mouth is :).
--
Regards/Gruss,
Boris.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/