Re: + edac-new-opteron-athlon64-memory-controller-driver.patch added to -mm tree

From: Eric W. Biederman
Date: Thu Jul 06 2006 - 14:43:03 EST


Andi Kleen <ak@xxxxxx> writes:

> On Thu, Jul 06, 2006 at 11:46:00AM -0600, Eric W. Biederman wrote:
>> Andi Kleen <ak@xxxxxx> writes:
>>
>> >> With EDAC on my next boot I get positive confirmation that I either
>> >> pulled the DIMM that the error happened on, or I pulled a different
>> >> DIMM.
>> >
>> > How? You simulate a new error and let EDAC resolve it?
>>
>> No. There is a status report that tells you which pieces of hardware
>> your memory controller sees. It is just a simple list.
>
> Ok but that could be also done easily in user space that reads
> PCI config space. No need for a complicated kernel driver at all.

User/kernel the task the driver has to do is the same, so
the complexity doesn't really change.

On some chipsets the registers are memory mapped, on others the
memory controller is hidden by default.

All of which are hard to deal with from user space.

>> Isn't something that just works, and is not at the mercy of the BIOS
>> developers with too little time worth doing?
>
> I just don't see how it's very useful if you don't know which DIMM
> to replace in the first place.

You do know which DIMM you just don't know what the label on the
motherboard is. That is a lot different from knowing that some DIMM
is bad.

> And to know that in your scheme you need
> your magical database with all motherboards ever shipped, which
> I don't consider realistic.

No you do not need a magic database. Having the mapping will save
you a few minutes as you debug your hardware, but it is not critical.

If you have a map of rank to motherboard label the work flow is:
- Look for the label on the motherboard and pull that DIMM.

If you do not have a map of rank to motherboard label the work flow is:
- Make an educated guess
- Boot up and see if you have pulled your target DIMM.
If so stop.
- If not turn off the computer
- Replace the DIMM
- start again with a different DIMM.

A simple linear search will not take long, and because you don't
have to reproduce the error on the problem DIMM it will go much
faster then if all you knew was that a DIMM was bad.

If you replace a lot of bad memory on a particular mother board
you will either build the map in your head or you will store it
in a file. Once you have the map in a file hopefully it will get
sent to the EDAC maintainer and some one else can be saved the
trouble.

Historically memory has had a pretty bad infant mortality so there
are people who replace a lot of memory on particular motherboards.

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/