Re: Memory Trauam

Richard B. Johnson (root@analogic.com)
Tue, 19 Nov 1996 11:58:31 -0500 (EST)


On Tue, 19 Nov 1996, Nicholas J. Leon wrote:

>
> So here's my dilemma: why didn't memtest 1.1 notice the bad ram? This
> wasn't the type of problem that showed up intermittedly: after booting
> my kernel INIT would ALWAYS fail.... ALWAYS. As would initrd.
>
> So what's the difference between how the real kernel accesses memory
> and memtest? It seems that memtest isn't the checker it should
> be. Many people on this list complain of bad ram, checked "OK" by
> memtest but failing with the kernel.
>

You have discovered the oldest software problem. You can't test RAM using
a program that runs in the RAM being tested! You can try, but you will
probably not find the bad RAM.

Most RAM testing programs work by reading what was written to see if the
results are the same. There are "walking-bit" tests, XOR tests, etc., all
designed to verify that what you wrote, you get back. The big problem is
that you don't know where else the write or the read occurred! You can't
assume that each bit in RAM is unique. It is supposed to be, but you do
NOT KNOW if it is. The slightest timing problem will result in several
places being modified in RAM. Your program can't know this unless it
crashes because the program got destroyed in the process.

RAM is organized into BITS (not bytes, words, longwords, pages, etc.).
For RAM to be working properly, every bit must be unique and every bit must
store its state forever. Most processors won't allow you to read or write
to a single bit in memory. The fact that you must read/write in bytes or
words or longwords makes it impossible to truly test a single bit. You
can try by writing a byte, for instance, with various bits set, then
reading the result to see if it "took". However you are fooling yourself
if you think that you have really tested that memory location.

To test a single byte, you would have to save a pattern of all the bits
existing in all of RAM. Then modify a single bit. Then read all the
other bits in all of RAM to make sure that they haven't changed.
Then you would have to alter a single bit in the rest of the RAM and
do the same thing all over again. You would do this (N^2 - 1) times
with N being the total number of bits in RAM. Then would you would
change the next bit of your tested byte and do the same thing all over
again.

Then you go on to the next byte, etc. In a few years you would have tested
all your RAM with a high probability of catching a single-bit failure.

It would still not have caught pattern sensitivity problems nor would
it have caught the fact that you didn't actually write to RAM when you
thought you did. These two problems exist. The pattern sensitivity can
come about due to poor power supply regulation and /or poor RAM power
rail bypassing. The fake-write problem comes about because the bus will
store (in its capacity) the last data written for a few tens of
nanoseconds.

Suppose you have NO RAM at location X. You write a byte to the location
then immediately read it back. If you can write and read quickly, the
data from the write will still be on the bus. You read exactly what you
wrote and presume that RAM must be good when, in fact, RAM did not even
exist.

You can attempt to "fix" this problem by putting something else on
the data bus between the write and the read (perhaps with a push and
a pop). The end result may be useful, but not conclusive.

Therefore RAM testing programs are useful but they do not thoroughly
test RAM. They may find bad chips if the chip is so bad that it doesn't
function very well at all. However, problems due to timing, design, and
drifting RAM characteristics are unlikely to be found.

Cheers,
Dick Johnson
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Richard B. Johnson
Project Engineer
Analogic Corporation
Voice : (508) 977-3000 ext. 3754
Fax : (508) 532-6097
Modem : (508) 977-6870
Ftp : ftp@boneserver.analogic.com
Email : rjohnson@analogic.com, johnson@analogic.com
Penguin : Linux version 2.1.11 on an i586 machine.
Warning : It's hard to remain at the trailing edge of technology.
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-