Re: [PATCH-2] NMI trap revised (was Re: NMI errors in 2.0.30??) (fwd)

Gabriel Paubert (paubert@iram.es)
Tue, 13 May 1997 22:13:30 +0200 (METDST)


Maybe you will receive an earlier version of this message I sent on Friday,
but it seemed we were in a black hole here, getting mail but unable to send
anything. At first we did not worry since e-mail delays of up to 4 days are
commonplace here, and one actually took 2 months to reach me last year :-(

On Fri, 9 May 1997, Riccardo Facchetti wrote:

> I have implemented the memory check.
> S.o. suggested that I should read instead of write. Read two times ...
> ah ... heh ... I remember ... processor caches ... hmmm in the next patch
> I will correct this thing :)
> nghe ... I'm just curious if it works or not (heh ... I have no way to
> test it). I suspect the memory test should be done in a cli()/sti() pair,
> because we do not want be disturbed by NMIs not generated intentionally by
> us.
During the NMI routine all interrupts are masked anyway, and even NMIs are
masked until the CPU executes an IRET instruction. But the problem is
printk...

There is probably a much simpler way of performing the test without
allocating any memory, remember that you simply need to read it and the
following should work (I've checked it generates the correct code), no
kmalloc, no memcpy:

volatile int *scan_ptr;
for(scan_ptr=(int *)page_ptr;
scan_ptr<(int *)(page_ptr+PAGE_SIZE); *scan_ptr++);

alternatively you could also use embedded assembly, that's the only case I
know in which the seemingly nonsense "rep lodsl" instruction could be
useful :)

asm("cld; rep; lodsl": : "S" (page_ptr), "c" (1024): "ax", "cx", "si");

and then even use the result value in %esi as the pointer to the next
page! Actually using rep lodsl might be important on 386s because it
minimizes the number of memory accesses for fetching instructions which
also could give NMI when scanning an otherwise perfectly valid page.

BTW: if after reading twice the whole memory you have not detected a single
error, what do you do ? I see two possible explanations:
- the code was performing some very short read-modify-write sequence,
possibly in a single instruction. The bad memory location has been
overwritten. A good reason to panic, especially if the two low order bits
of CS saved on the stack are zero (kernel mode NMI)...

- the BIOS memory timings are a bit too aggressive, causing errors only
perhaps when there is a lot of bus activity: this heats up the bus drivers
and memory chips and slows them down (at least CMOS chips, actually bipolar
logic runs faster at high temperature). In this case the memory may be good
but too tight delays have caused metastability in a latch or flip-flop.
This is likely to be very temperature dependant. Try to move the computer
to a slightly warmer place and load your machine (kernel compile...), a
few degrees can make a large difference... Then try to change your BIOS
settings first.

Gabriel