Re: [PATCH 00/22] HWPOISON: Intro (v5)

From: Andi Kleen
Date: Mon Jun 15 2009 - 12:11:01 EST

On Mon, Jun 15, 2009 at 04:28:04PM +0100, Alan "zSeries" Cox wrote:

> curse a lot
> suspend to disk
> remove dirt from fans, clean/replace RAM
> resume from disk
> The very act of making the ECC error not take out the box creates the

Ok so at least you agree now that handling these errors without
panic is the right thing to do. That's at least some progress.

> environment whereby the underlying hardware error (if there was one) can
> be cured.

These ECC errors are still somewhat rare (or rather if they become
common you should definitely service the system). That is why
losing a single page of memory for them isn't a big issue normally.

Sure you could spend effort making unpoisioning work,
but it would seem very dubious to me. After all it's just another
4K of memory for each error.

The only reasonably good use case I heard for unpoisoning was
if you have a lot of huge pages (you can't use a full huge page with one bad
small page), but that's also still relatively exotic.


[1] mostly you need a new special form of RCU I think

ak@xxxxxxxxxxxxxxx -- Speaking for myself only.
