Re: [PATCH 00/22] HWPOISON: Intro (v5)

From: Russ Anderson
Date: Tue Jun 16 2009 - 16:55:21 EST


On Tue, Jun 16, 2009 at 01:28:54PM -0700, H. Peter Anvin wrote:
> Russ Anderson wrote:
> > On Mon, Jun 15, 2009 at 03:29:34PM +0200, Andi Kleen wrote:
> >> I think you're wrong about killing processes decreasing
> >> reliability. Traditionally we always tried to keep things running if possible
> >> instead of panicing.
> >
> > Customers love the ia64 feature of killing a user process instead of
> > panicing the system when a user process hits a memory uncorrectable
> > error. Avoiding a system panic is a very good thing.
>
> Sometimes (sometimes it's a very bad thing.)
>
> However, the more fundamental thing is that it is always trivial to
> promote an error to a higher severity; the opposite is not true. As
> such, it becomes an administrator-set policy, which is what it needs to be.

Good point. On ia64 the recovery code is implemented as a kernel
loadable module. Installing the module turns on the feature.

That is handy for customer demos. Install the module, inject a
memory error, have an application read the bad data and get killed.
Repeat a few times. Then uninstall the module, inject a
memory error, have an application read the bad data and watch
the system panic.

Then it is the customer's choice to have it on or off.

--
Russ Anderson, OS RAS/Partitioning Project Lead
SGI - Silicon Graphics Inc rja@xxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/