Re: NMI errors in 2.0.30??, High Availability-Linux

Doug Ledford (dledford@dialnet.net)
Fri, 25 Apr 1997 21:47:16 -0500


--------
>
> On Thu, 24 Apr 1997, Jon Lewis wrote:
> > Uhhuh. NMI received. Dazed and confused, but trying to continue
> > You probably have a hardware problem with your RAM chips or a
> > power saving mode enabled.
> >
> > I really don't believe the message, as this is a Tomcat IIID (running with
> > 2 CPU's but not an SMP kernel), 4 8x36-60 simms, and the setup passed
> > several hours of memtest86 before going online. The CMOS setup is
> > configured to do ECC and report single bit errors...could this cause
> > problems for linux? I always disable all the power saving stuff...so I'd
> > say there's at least a 99% chance it's turned off. Is it possible some
> > other random kernel bug is at fault?
>
> I am wondering this now as well. I have just upgraded to v2.0.30 on my news
> server here and have started receiving several NMI messages as well. I have
> had ECC turned on in this machine since day 1 (Tyan S1668, w/ 128MB Parity memory).
> and have never seen these messages before on the system.

Seeing this problem on the Tyan motherboards, which are widely known to be
some of the fastest boards you can get, I would seriously look into the
following.

Try de-tuning your cache/RAM in your machines BIOS and see if the problems go
away. It wouldn't suprise me at all if the Tyan corporation eeks some of that
extra speed out of the RAM and the machine by overtuning RAM speeds on the
knowledge that DOS, Winbloze, etc. run too slowly to cause problems. On the
other hand, even a modest increase in the speed of linux has been known to
cause these problems in the past (and they often occur in the ext2 code, both
during checks and during operation). On normal machines, this would result in
corruption, with parity and ECC RAM, it gets caught with these messages (ECC
corrects it, I don't believe the parity RAM does anything but note the problem
and we still get the occasional corruption). If you can de-tune your RAM or
cache and the problem goes away, then it's a fairly solid indicatory that
2.0.30 is slightly faster than 2.0.29 and it's causing marginal memory setups
to break.

As to the statements about memtest86. Hmmm....my canned response would be to
junk that program. I've had more luck breaking memory with make -j zlilo than
I have with DOS memory testers (with 64MB of RAM anyway, with less I would
bound the -j instead of leaving it an infinite parallel make). I've also had
more luck breaking cache with simply trying to copy a large file (250+MB) from
a stable machine to the test machine on a quite 10BT network with fast bus
mastering NIC cards that pump the data quickly (moving all that data through
the cache tends to exercise it really well). I have yet to find a memory
tester that finds broken cache/RAM as well as a finely tuned Unix system (even
SCO has found broken memory where CheckIt 3.0 and MicroDiag 2000 both passed
the stuff). I think one of the inherent problems with dedicated memory
testers like this is the simple fact that they don't simultaneously slam the
RAM frmo both the CPU and the bus side. On modern PCI chipsets, it is
entirely possible that your bus mastering DMA controllers can be accessing RAM
concurrently as you are (arbitrated by the chipset) and the combined load is
much harder on the RAM, and the cache, than any simple memory tester.

A good test for this that I used to use is as follows:

grab the latest linux source tree (that's right, you've already got that :)

find a directory in a filesystem with at least 200MB free.
cd there.
gunzip linuxxxx.tar.gz
tar xvf linux.tar
mv linux linux.reference
for i in 1 2 3 4 5 6 7 8 9 10
do
tar xvf linux.tar
mv linux linux.$i
done
for i in 1 2 3 4 5 6 7 8 9 10
do
diff -rN linux.reference linux.$i
done

This particular test finds memory errors on systems with good bus mastering
disk controllers long before any dos tester ever does. If the diffs don't
come back clean (no output to screen) then you have a problem somewhere,
usually memory, but depending on your hard drive and controller, it could also
be the driver if it isn't known to be reliable. Also, adjust the number of
loops according to the free space you have, since you can fill a disk this way
:)

Why does this work so well? Simple, by unzipping the tar file, you've created
one large file that is extremely fast for modern processors to run through and
grab the pertenant data out of, which tar then pipes into a new file. The
involves lots of memory writes. Then, the process of reading inode tables
from disk, etc. results in more memory writes. The process of writing the
information to disk results in lots of bus master transfers from RAM to disk
(assuming you have a decent SCSI or bus mastering IDE controller, but you have
128MB of RAM so assume this to be the case :) You end up with a large chunk
of memory that gets read repeatedly and copied over, then read and copied out
by your disk controller. The faster your disk controller, the better this
test works. If you leave the file gzipped then it exercises your CPU more and
your RAM less, so it can also be used as a marginal CPU test (it won't pick up
as many defective CPUs as the make -j zlilo will).

Alan: Is there a reason that I don't know of that you always recommend
memtest86? I can think of two reasons. One, it's a hell of a lot easier to
say grab memtest86 than it is to write an essay like I did. Second, memtest86
(if it manages to find something) is definitive as to the cause, whereas tests
like I just wrote about don't truly isolate the problem unless you have
already eliminated the other possible causes (CPU and disk primarily). Are
there any other reasons I don't know about that you recommend memtest86 as the
stock reply for possible memory test problems? Then again, what about what I
brought up with memtest86 not exercising memory in conjunction with heavy
memory hits by bus mastering controllers?

-- 
*****************************************************************************
* Doug Ledford                      *   Unix, Novell, Dos, Windows 3.x,     *
* dledford@dialnet.net    873-DIAL  *     WfW, Windows 95 & NT Technician   *
*   PPP access $14.95/month         *****************************************
*   Springfield, MO and surrounding * Usenet news, e-mail and shell account.*
*   communities.  Sign-up online at * Web page creation and hosting, other  *
*   873-9000 V.34                   * services available, call for info.    *
*****************************************************************************