Flaky memory; would like help diagnosing

Colin Plumb (colin@nyx.net)
Fri, 19 Apr 96 13:14:14 MDT


I'm suffering the usual flaky-motherboard problems, occasional
signal 11s when compiling the kernel, etc. I've written a few diagnostics
to help me figure things out, and a few kernel patches to make things more
stable, and I'm looking for more ideas.

The problem appears to *always* take the form of bit 1 becoming
mysteriously set at an address of the form 0x?????004, i.e. offset 4
on a 4kB page. In the kernel, this shows up in free_one_pmd
(bad directory entry 0x00000001, which is treated as 0 and is thus
harmless) and "block on free list not free" because the pointer *to*
the block gets corrupted and incremented, resulting in the magic number
not matching. This causes problems forking and execing and generally
makes the system unusable, so to let myself shut down cleanly I
put in a kernel kludge (all hail having source!) to detect and fix the
problem.

I have since written a user program which allocates most of memory and
walks through it writing test patterns, then reading them back and checking
for errors. If I make the amount of memory enough to induce swapping,
the rate at which I encounter errors goes up considerably, even though
the rate at which test vectors are performed goes down (due to the
time spent swapping).

I'm not sure if it's the case that the swapping system (Adaptec 1542 SCSI,
at the "normal" (5 MB/s, isn't it?) DMA speed) has problems, or that
the page-table-shuffling done by swapping is making the bad page jump
around and show up more.

If I have my test program, after finding an error, fix it and read the
4 MB region around the error (I tried various ranges, to see whether
the error was introduced from memory to secondary cache, from secondary
cache to primary (on the 486/66DX2 I'm running) cache, or primary to
registers) repeatedly (I've done so up to 32K times, alothough I
usually run with a 1000x limit, using a "repe scasl" loop), the problem
does *not* recur. (Although once I hit a second instance at a different
offset in the same 4MB region.)

If I have it *not* fix the error, but cyclically read 32M (my system memory
size) of data to force everything to be swapped out, and then check again,
the problem is usually fixed. I didn't write the page, so it should
still be clean in swap space, and the swap-in should just re-read the
same old data, only to a different physical address.

I'd like to do two things to nail this down a little better.
One is figure out the physical address of the error. Does
anyone know how to find the physical address corresponding to a
user-space virtual address? If it's always in a few physical
pages, I can just lock those out and be done with the problem.

The second is to check the swapping. I'd like to checksum a page
before swapping it out and validate the checksum when reading it
back in, but that's a bit tricky. There can be multiple processes
waiting for a block to come back in, some wanting to write, and I have
to check the checksum before the writes go through. (I'd rather get in
befire the reads, even, so I can do ECC.) Does anyone know where
I can stick in such code? It's not hard to write the code (I'd
add a 32-bit checksum field to the page structure, with two 16-bit
CRCs in the halves. One would be x^16+1, i.e. the XOR of all the 16-bit
words. The other would be something irreducible, and after checking
them, I could do ECC by error trapping, cycling the irreducible
CRC backwards until it matches the XOR pattern.)

Any suggestions on this subject would definitely be appreciated.

-- 
	-Colin