Re: [PATCH 3/3] ghes_edac: add platform check to enable ghes_edac

From: Mauro Carvalho Chehab
Date: Fri Jul 21 2017 - 13:01:51 EST


Em Fri, 21 Jul 2017 16:40:20 +0000
"Kani, Toshimitsu" <toshi.kani@xxxxxxx> escreveu:

> On Fri, 2017-07-21 at 12:44 -0300, Mauro Carvalho Chehab wrote:
> > Em Fri, 21 Jul 2017 15:34:50 +0000
> > "Kani, Toshimitsu" <toshi.kani@xxxxxxx> escreveu:
> >
> > > On Fri, 2017-07-21 at 17:13 +0200, Borislav Petkov wrote:
> > > > On Fri, Jul 21, 2017 at 03:08:41PM +0000, Kani, Toshimitsu
> > > > wrote:ÂÂ
> > > > > Yes, that is correct.ÂÂCorrected errors are reported to the OS
> > > > > when they exceeded the platform's threshold.ÂÂ
> > > >
> > > > Are those thresholds user-configurable?ÂÂ
> > >
> > > I suppose it'd depend on vendors, but I do not think users can do
> > > it properly unless they have depth knowledge about the hardware.
> > >
> > > > If not, what are you telling users who want to see *every*
> > > > corrected error for measuring DIMM wear and so on...?ÂÂ
> > >
> > > Corrected errors are normal and expected to occur on healthy
> > > hardware. They do not need user's attention until they repeatedly
> > > occurred at a same place.
> >
> > Yes, they're expected to happen. Still, some sys admins have their
> > own measurements about what's "normal" for their scenario, and want
> > to monitor every single corrected error, running their own
> > algorithm to warn if the number of corrected errors is above their
> > "normal" rate.
>
> I suppose these admins had to do it because their platforms reported
> all corrected errors. It addresses such administrators' burden.

I see the value of having a threshold in BIOS, provided that it is
well documented, and whose value can be adjusted, if needed.

One of the things I wanted to implement in ras-daemon were an
algorithm that would be doing such threshold in software.
The problem is that it would require field experience. So,
I talked with a few vendors, to see if they could help doing
it, but, on that time, none rised their hands :-)

The thing with a BIOS threshold is that the user has no way to
audit the algorithm. So, when BIOS start reporting such errors,
it may be already too late: the systems may be in the verge of
losing data (or some data was already lost).

That's critical on cluster systems with thousands of machines:
while the impact of disabling a cluster node to do some maintainance
is marginal, the impact of an uncorrected error on a single
machine may compromise weeks of expensive processing.

That's why some users prefer to monitor every single corrected
error, and compare with the probability distribution they
know that the risk of uncorrected errors is acceptable.

Thanks,
Mauro