Re: Hardware Error Kernel Mini-Summit
From: Eric W. Biederman
Date: Mon Jun 14 2010 - 16:07:24 EST
Andi Kleen <andi@xxxxxxxxxxxxxx> writes:
>> Just left the above for reference. How would this affect other
>> aspects of EDAC such as the error injection, the sysfs
>> entries that (in most cases) reflect the layout of dimm's, and
>
> Some of this can be probably retained, about the way EDAC
> e.g. represents layout is quite unsuitable too. It includes
> a lot of internal implementation details that in some cases
> you can't even get anymore on modern design. Something
> with a proper abstract interface is better. EDAC never had this.
It sounds like you can't be bothered to understand the EDAC code,
or the fact that some users actually like to know when their hardware
is having problems.
> Also the biggest problem is still that EDAC doesn't
> give you any silk screen labels, so unless you
> have motherboard schemantics the layout it presents
> is fairly useless -- you still don't know which DIMM
> to exchange. So in theory EDAC looks great, but in practice ...
- In practice it works even without silk screen labels.
- The current EDAC code displays which DIMMS you have plugged
in so you can tell if you unplug one, if it was the DIMM
you were aiming at.
> On a lot of modern systems I checked DMI
> seems reasonably accurate in terms of layout, so I suspect they can
> be handled with this. For others probably
> still need some special driver, but one
> with a proper interface.
DMI is great on the days it works, there is a lot of variations
between BIOS's. Also if the information is decent it can be
used to inform the current EDAC code as well as anything else.
You mean an interface that doesn't report the error so people
won't complain to you about a near useless kernel error
message.
> Anyways the old EDAC drivers for this are not going
> away, you can still use them. The interesting
> question though is how to properly define the interface
> for new hardware.
>
>> allow the setting of scrub rate? If we're just talking about
>
> I never quite saw the point of that one, but yes
> there's no replacement for this anywhere else.
>
> Normally scrub rate can be simply set in the BIOS,
> is that not good enough? Is there a use case for
> changing it dynamically?
>
> Note that modern hardware typically has demand scrubbing
> anyways, that is when there is an error it automatically
> scrubs.
Setting the scrub rate isn't half so interesting as displaying
it.
Having basic hardware information displayed in sysfs seems to be the
design of the rest of linux. I don't see abandoning that part of the
EDAC design as wise.
Displaying the fact that ECC is turned on in the hardware is one
of the more interesting bits. That at least allows you to verify
that things are working.
>> replacing all instances of printk (when logging single bit
>> errors) with perf events, I don't really see that as a problem.
>
> I don't think perf is the right tool for this, the semantics
> are mostly unsuitable (it hasn't been designed as a error reporting
> tool, but as a performance tool and performance events are quite
> different from errors) and it doesn't provide most of the infrastructure
> needed for it anyways.
I will agree with that. The argument that errors that should only
happen rarely need a high performance handler seems to indicate
there is some deep misunderstanding of the code.
>> But EDAC is much more than that today...
>
> Well it's a hodge podge of quite a lot of odd bits.
> I'm not sure "more" is the right word.
If the basic errors could be posted in some kind of NMI/machine check
safe data structure it would not be hard to get EDAC drivers to
consume them.
Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/