Re: Memory Trauam

Ken Jordan (kenjordan@massmedia.com)
Tue, 19 Nov 1996 13:47:59 -0800 (PST)


On Tue, 19 Nov 1996, Nicholas J. Leon wrote:

>
> Folks -
>
> I'm hoping that someone could clear something up for me. It is in
> regards to memory. If you will remember, about 1 week ago I posted a
> comment about my new ASUS and it's EDO ram that wouldn't work unless I
> cut it in half with a mem=8m boot parameter. It was that letter that
> sparked the thread on NMI/ECC.
>
> Well, I got replacement memory and sure enough, all works well.
>
> So here's my dilemma: why didn't memtest 1.1 notice the bad ram? This
> wasn't the type of problem that showed up intermittedly: after booting
> my kernel INIT would ALWAYS fail.... ALWAYS. As would initrd.
>
> So what's the difference between how the real kernel accesses memory
> and memtest? It seems that memtest isn't the checker it should
> be. Many people on this list complain of bad ram, checked "OK" by
> memtest but failing with the kernel.
>
> I believe we should look into providing another tool for detecting
> these errors. Not a part of the true kernel, but perhaps derived from
> it. At least that way, hopefully, we can get consistent errors from
> the kernel and memtest.
>
> Just my $0.02 ....
>

Memtest did find a problem that I had with a bad SIMM that three different
DOS memory test programs didn't find.

I did have to disable my caches in the BIOS before memtest started working
correctly (if not, it would just _fly_ through the addresses instead of
the normal slow plod).

No memory test program can find all possible RAM errors though (many are
subtle or extremely timing dependent).

The best burn in test I've found to see if a Linux machine is reliable is
to make it re-build the kernel over and over with a "-j5" (or so). Let
this run over a weekend and then check the log and make sure no "signal
11" or "internal error" messages appear (and that the zImage cmp's).

Take it easy,
Ken Jordan