Re: NMI errors in 2.0.30??, High Availability-Linux

Gabriel Paubert (paubert@iram.es)
Fri, 25 Apr 1997 12:49:22 +0200 (METDST)


On Thu, 24 Apr 1997, Jon Lewis wrote:

> Uhhuh. NMI received. Dazed and confused, but trying to continue
> You probably have a hardware problem with your RAM chips or a
> power saving mode enabled.
>
> I really don't believe the message, as this is a Tomcat IIID (running with
> 2 CPU's but not an SMP kernel), 4 8x36-60 simms, and the setup passed
> several hours of memtest86 before going online. The CMOS setup is
> configured to do ECC and report single bit errors...could this cause
> problems for linux? I always disable all the power saving stuff...so I'd
> say there's at least a 99% chance it's turned off. Is it possible some
> other random kernel bug is at fault?

> The system this one replaced used to get occasional page table or swap
> corruption, and people suggested it could be bad RAM. It's been running
> memtest86 for over a week and done 143 passes with 0 errors.

I just downloaded the source code of memtest86 and IMHO it has two design
flaws:

- it tries to slow down the refresh rate to catch bad RAMS. Great idea,
but the timer is only used to generate a refresh signal on the ISA bus and
is a relic from the the very first PC. Now PCI chipsets generate their own
refresh on the DRAM side of the bridge and it seems you can not do
anything about it (from Intel's 430VX chipset doc, section 4.4, file
29055301.pdf).

- when trying to detect problems with RAM refresh, you should stop any
memory access for much longer than the RAM refresh period. This is because
any access to a DRAM implicitly includes a refresh cycle, and depending on
the multiplexing order of physical address to DRAM Columns/Rows, a simple
sequential access can result in performing very fast refresh of the whole
memory. This pause should be implemented as a very simple idle loop
running off the cache (1: dec %eax; jnz 1b). I have not seen any such
pause in memtest86.

So my conclusion is that if your memory has refresh problems, memtest86
will likely not catch them.

Now has anybody written some code to handle ECC memory, detecting and
reporting errors ?

I have a board (running Linux/PPC) with ECC memory and the chipset's
documentation, and I am ready to participate in writing code for this
type of things. A merged PPC/x86 source code has the advantage of being
endian independant and easy to port to any other machine type. IMO the
High Availability Linux project requires this, as any system without
ECC memory cannot qualify as highly available.

Gabriel.