Re: noisy edac

From: Eric W. Biederman
Date: Mon Jan 30 2006 - 22:21:22 EST


Dave Peterson <dsp@xxxxxxxx> writes:

> On Monday 30 January 2006 16:02, Gunther Mayer wrote:
>> Just printk() the exact driver specific low-level error, even if non-fatal.
>>
>> Single non-fatal errors just show your system recovers correctly.
>>
>> Multiple (e.g. noisy) non-fatal are either an indication of a serious
>> problem
>> (e.g. after how many corrected ECC errors on the same address in which
>> time interval will you replace your dimm? How many S-ATA CRC-errors
>> will indicate marginal bad cabling? )
>> or it shows the problem needs to be root analyzed. But don't disable the
>> messages as this will only hide the real problem.
>>
>> Concerning Non-Fatal PCI Express errors, the error cause registers need
>> to be printed in case of error, too (see Intel Chipset Specifications)
>
> I agree that in general, the kernel should not be silent when errors are
> detected. However, for a particular type of error, it may be that the
> user is aware of the error (because (s)he has seen the messages), the user
> has determined the root cause, and it turns out that the error is benign.
> Therefore the user may want to suppress further messages of this type to
> avoid cluttering the logs. If you don't provide that option to the user,
> then it can be viewed as hardcoding a certain aspect of sysadmin policy
> into the kernel.
>
> Depending on the particular type of error, it may be appropriate to just
> offer the user two options: either printk() or be silent. For other types
> of errors, it may make sense to give the user more than two options (for
> instance ignore, printk(), or panic()). I think developers of chipset
> drivers can make this decision individually for each type of error.

One piece missing from this conversation is the issue that we need errors
in a uniform format. That is why edac_mc has helper functions.

However there will always be errors that don't fit any particular model.
Could we add a edac_printk(dev, ); That is similar to dev_printk but
prints out an EDAC header and the device on which the error was found?
Letting the rest of the string be user specified.

For actual control that interface may be to blunt, but at least for people
looking in the logs it allows all of the errors to be detected and harvested.

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/