From: Russ Anderson
Date: Tue Apr 12 2011 - 22:24:12 EST

On Tue, Apr 12, 2011 at 01:02:21PM -0700, Luck, Tony wrote:
> > Why not? This way you turn reporting of _ALL_ correctable MCEs
> > completely off and some users would actually like to run them through
> > mcelog on Intel.
> pr_emerg() is rather overkill for a corrected error - on large systems
> corrected errors are going to be a routine occurrence (my personal estimation
> is "one soft error per gigabyte per month" ... which is pretty much the
> same as "one per terabyte per hour" for the people with the really cool
> toys.

Good point.

> We are also setting TAINT_MACHINE_CHECK for corrected errors - perhaps
> this made sense when systems were small and machine checks were rare and
> scary. But I think we need to start working with the reality that
> corrected errors are normal events.

I agree. Corrected errors - by definition - have hardware corrected data.
There is no corruption so there is no reason for kernel taint. It would
be like setting taint when one hard drive of a RAID file system goes bad.

It's worth noting that linux does not set taint when it recovers from
_uncorrected_ memory errors on IA64 (by killing the application
that consumed the bad data and discarding the bad page). Modern hardware
has enough error detection/correction code to avoid undetected data
corruption from memory errors.

Russ Anderson, OS RAS/Partitioning Project Lead
SGI - Silicon Graphics Inc rja@xxxxxxx
