Re: [PATCH] NMI trap revised (was Re: NMI errors in 2.0.30??)

Riccardo Facchetti (fizban@mbox.vol.it)
Thu, 8 May 1997 13:29:58 +0200 (MET DST)


On Thu, 8 May 1997, Rogier Wolff wrote:

> > + /*
> > + * May be sort out what memory chip is failing ?
> > + * Heh ... with parity memory we can be a good memory
> > + * test program too :)
> > + * It should be something like:
> > + *
> > + * (1) disable NMI interrupts writing 1 in bit 7 of
> > + * port 0x70
> > + * (2) reset the NMI memory parity error flag (bit 7)
> > + * toggling bit 2 of 0x61 port to 1 and then to 0
> > + * (3) while all flat memory is tested:
> > + * (4) write 4Kb page in memory
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > + * (5) test if any NMI is pending: if yes, the
> > + * last page written is bogus, printk its
> > + * address.
> > + * (6) ++ page
>
> I maintain the "sig11" page which explains a lot about bad memory. I
> allows me to get lots of mails from people with different memory
> problems. In more than 90% of the cases the problems only show when
> you do a kernel compile. My memory test program (which is based on the
> knowledge of a leading expert on semiconductor memory testing.) detects
> only a very small percentage of memory trouble.

Yes of course. I'm not saying that the NMI method is the ultimate
solution. Of course it is as all other methods. But for parity memory,
having a mechanism to check for error condition, you can skip out all the
"write a pattern-read a pattern-compare two patterns". You have just to
write to all the memory locations and poll the 0x61 for the error
condition. Of course this works _only_ with parity memory.
And just another thing. I think this is faster compared to the
write-read-compare loop, or for an equal amount of time spent running the
job, this method can do more loops than the memtest one: the hardware does
all the checks for you. Doing more tests in the same amount of time you
are more likely to find the error: it is just a matter of likelyhood. Just
the same as compiling the kernel, where the memory test is just "use this
pointer for normal compile operation <- bad pointer ? -> sig11", no
overhead for test patterns, just a lot of pointer usage on all the memory
available, and the hardware (in different way than NMI) does the checks
for you.

Just my thought.

Ciao,
Riccardo.