Re: [PATCH 0/2] Generic hardware error reporting support

From: Linus Torvalds
Date: Fri Nov 19 2010 - 21:17:14 EST


On Fri, Nov 19, 2010 at 6:04 PM, huang ying
<huang.ying.caritas@xxxxxxxxx> wrote:
>
> We thought about 'printk' for hardware errors before, but it has some
> issues too.
>
> 1) It mixes software errors and hardware errors. When Andi Kleen
> maintained the Machine Check code, he found many users report the
> hardware errors as software bug to software vendor instead of as
> hardware error to hardware vendor. Having explicit hardware error
> reporting interface may help these users.

Bah. Many machine checks _were_ software errors. They were things like
the BIOS not clearing some old pending state etc.

The confusion came not from printk, but simply from ambiguous errors.
When is a machine check hardware-related? It's not at all always
obvious.

Sometimes machine checks are from uninitialized hardware state, where
_software_ hasn't initialized it. Is it a hardware bug? No.

> 2) Hardware error reporting may flush other information in printk
> buffer. Considering one pin of your ECC DIMM is broken, tons of 1 bit
> corrected memory error will be reported. Although we can enforce some
> kind of throttling, your printk buffer may be full of the hardware
> error reporting eventually.

Sure. That doesn't change the fact that finding the data is your
/var/log/messages and your regular logging tools is still a lot more
useful than having some random tool that is specialized and that most
IT people won't know about. And that won't be good at doing network
reporting etc etc.

The thing is, hardware errors aren't that special. Sure, hardware
people always think so. But to anybody else, a hardware error is "just
another source of issues".

Anybody who thinks that hardware errors are special and needs a
special interface is missing that point totally.

And I really do understand why people inside Intel would miss that
point. To YOU guys the hardware errors you report are magical and
special. But that's always true. To _everybody_, the errors _they_
report is special. Like snowflakes, we're all unique. And we're all
the same.

> 3) We need some kind of user space hardware error daemon, which is
> used to enforce some policy. For example, if the number of corrected
> memory errors reported on one page exceeds the threshold, we can
> offline the page to prevent some fatal error to occur in the future,
> because fatal error may begin with corrected errors in reality. printk
> is good for administrator, and may be not good enough for the hardware
> error daemon.

And by "we", who do you mean exactly? The fact is, "we" covers a lot
of ground, and I don't think your statement is in the least true.

Yes, IT people want to know. When they start seeing hardware errors,
they'll start replacing the machine as soon as they can. Whether that
replacement is then "in five minutes" or "four months from now" is up
to their management, their replacement policy, and based on how
critical that machine is.

IT HAS NOTHING WHAT-SO-EVER TO DO WITH HOW OFTEN THE ERRORS HAPPEN.

And yes, Intel can do guidelines, but when you say there should be
some "enforced policy" by some tool, you're simply just wrong.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/