Re: [PATCH 1/5] HWPOISON: define VM_FAULT_HWPOISON to 0 when feature is disabled

From: Nick Piggin
Date: Mon Jun 15 2009 - 03:04:29 EST


On Fri, Jun 12, 2009 at 09:37:24AM -0700, H. Peter Anvin wrote:
> Ingo Molnar wrote:
> >
> > So i think hwpoison simply does not affect our ability to get log
> > messages out - but it sure allows crappier hardware to be used.
> > Am i wrong about that for some reason?
> >
>
> Crappy hardware isn't the kind of hardware that is likely to have the
> hwpoison features, just like crappy hardware generally doesn't even have
> ECC -- or even basic parity checking (I personally think non-ECC memory
> should be considered a crime against humanity in this day and age.)

What I would find interesting with this hwpoison would be the probability
difference between detecting an uncorrected error, and undetected errors.


> These kinds of features are used when extremely high reliability is
> required, think for example a telco core router. A page error may have
> happened due to stray radiation or through power supply glitches (which
> happen even in the best of systems), but if they are a pattern, a box
> needs to be replaced. *How quickly* a box can be taken out of service
> and replaced can vary greatly, and its urgency depend on patterns;
> furthermore, in the meantime the device has to work the best it can.

I don't know how much improvements that hwpoison will give. Significant
amount of RAM cannot be corrected, so especially on like a core router
or embedded system which does not use a lot of disk/pagecache, then it
is probably more like 2x improvement rather than an order of magnitude
improvement.


> Consider, for example, a control computer on the Hubble Space Telescope
> -- the only way to replace it is by space shuttle, and you can safely
> guarantee that *that* won't happen in a heartbeat. On the new Herschel
> Space Observatory, not even the space shuttle can help: if the computers
> die, *or* if bad data gets fed to its control system, the spacecraft is
> lost. As such, it's of paramount importance for the computers to (a)
> continue to provide service at the level the hardware is capable of
> doing, (b) as accurately as possible continually assess and report that
> level of service, and (c) not allow a failure to pass undetected. A lot
> of failures are simple one-time events (especially in space, a high-rad
> environment), others reflect decaying hardware but can be isolated (e.g.
> a RAM cell which has developed a short circuit, or a CPU core which has
> a damaged ALU), while others yet reflect a general ill health of the
> system that cannot be recovered.

I guess most of these examples have to go far beyond this and use
multiply redundant computation and voting systems and quickly
reboot members that are kicked out. :)

Not that it is a detrement of hwpoison. If they used Linux I'm
sure they would like to panic on uncorrected error too (but would
probably not bother trying to do heuristic recovery).
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/