Re: + edac-new-opteron-athlon64-memory-controller-driver.patch added to -mm tree

From: Eric W. Biederman
Date: Thu Jul 06 2006 - 02:15:45 EST

Next message: Vojtech Pavlik: "Re: Generic interface for accelerometers (AMS, HDAPS, ...)"
Previous message: Parag Warudkar: "[PATCH] intelfbhw.c: intelfbhw_get_p1p2 defined but not used"
In reply to: Andi Kleen: "Re: + edac-new-opteron-athlon64-memory-controller-driver.patch added to -mm tree"
Next in thread: Andi Kleen: "Re: + edac-new-opteron-athlon64-memory-controller-driver.patch added to -mm tree"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

I think if this conversation is going to make headway we
need to step back a minute, and ask what makes sense to do
an where and not get caught up the details of an implementation.

The goal of the EDAC code is to report errors the hardware has
seen to upper layers of software so someone can do something with
them. Also it is hoped we can get a moderately standard interface
so that automated tools can recognize and do something when a
problem is reported. Is this a reasonable set of goals?

Memory errors are by far the most common kind of hardware error
and worth discussing. For an uncorrectable memory error there
are 3 interesting pieces of information.

- Which cpu address did the error happen at.
So we can kill the processes using that memory.
Although simply killing the entire machine appears acceptable.

- What is the chipsets idea of which DIMM the memory error occurred on.
For bus based memory architectures like the opteron this
is a chip select of the DIMM rank.
For serial memory architectures this is some kind of bus address,
but still useful for describing individual chips.

- What is the silk screen label on the motherboard that corresponds
to the chip selects with problems.

If you look at the memory controller, and the associated error
reporting registers (which are sometimes available in the machine check).
There has always been enough information to determine the hardware
address the memory controller knows the DIMM by.

Getting the address of the error is usually possible but not always
and not always very reliably.

Mapping between the hardware address that the memory controller
knows DIMMS by and the actual DIMMS themselves is actually
pretty easy even if you don't have any motherboard information.
It is just a matter of plugging in DIMMS in different positions
and seeing which DIMMS that the hardware currently sees. It's
maybe half a days work on an unknown motherboard.

...

Assuming we can agree that this is sane information we want. The
remaining question is how do we capture it.

For the mapping to the hardware address that the memory controller
knows the DIMM by requires the reading of hardware registers,
some that are not easily accessible to user space so a kernel driver
tends to make sense, just to get the information.

Possibly we could just export that information and let the
user space figure it out from there. But memory is a key system
component and hardware designers are very creative so coming
up with a consistent model would be very hard. So far we
have had to improve our helper functions every couple of chipsets
because the old models broke. Writing a driver split halfway between
the kernel and user space sounds silly.

....

The other pieces to me seem much more fluid. Especially since EDAC
does not yet export much if anything to user space except through
printk's in any stable kernel.

....

As for the suggestion of using DMI as best as I can determine it
suffers rather badly from the never ending creativity of the chipset
developers and does not have a model that can describe what needs
to happen for the current generation of chipset much less the bleeding
edge ones. Which is besides the fact that the only thing that you can
usually trust in DMI tables is the motherboard manufacturer.

I do think getting the motherboard id out of DMI provides a great key
to build a memory controller hardware address to DIMM label lookup
table. With EDAC we have been computing that information in user
space and caching it kernel side so we could generate immediately
useful print statements. Which is handy but probably not necessary.

Eric

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Vojtech Pavlik: "Re: Generic interface for accelerometers (AMS, HDAPS, ...)"
Previous message: Parag Warudkar: "[PATCH] intelfbhw.c: intelfbhw_get_p1p2 defined but not used"
In reply to: Andi Kleen: "Re: + edac-new-opteron-athlon64-memory-controller-driver.patch added to -mm tree"
Next in thread: Andi Kleen: "Re: + edac-new-opteron-athlon64-memory-controller-driver.patch added to -mm tree"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]