Re: [PROPOSAL] Coping with random bit errors

Colin Plumb (colin@nyx.net)
Sat, 11 Oct 1997 04:16:01 -0600 (MDT)


This doesn't solve anything, since by the time the software has found the
problem, it's probably already caused errors. But it *does* let you
figure out where it's happening and look for patterens. A good error
detection code could tell you where the error is so you could point
the finger at a bad SIMM, general flakiness, or whatever. That's worth
keeping track of.

The idea is basically to get a bit more precision out of the kernel
rebuild memory test which seems to catch more errors than anything else.

I was thinking of adding a word to the page structure which would be a
checksum of all read-only pages in memory. (And a checksum of pages
swapped to disk would detect bad SCSI cables and so on.)

When the page is about to get un-read-protected, freed, or is otherwise
losing it's read-only status, you do the checksum again and complain
if anything has changed.

Locating signle-bit errors is equivalent to correcting them, since
the repair obviously consists of just flipping the bit.

In a 4K page, there are 32K possible single-bit errors. There are 2^29
possible double-bit errors, so that's analyzable with a particularly
good error control code, but I confess that I don't know how to do it
myself. You could also use a single-error-correcting 16-bit Reed-Solomon
code to detect any 16-bit word in error, or there are other schemes.

To do this "perfectly", checksumming at the first possible moment and
verifying at the last, for maximum coverage, would definitely slow a
machine down (thrash the cache terribly!), but it would be a real boon
to folks with erratic memory problems.

And it could probably be adapted to a kind of background mode where the
idle task walks through pages, and if they're read-only, computes a checksum.
If they haven't been changed since the last time a checksum was computed,
but the checksum differs, we have memory corruption. In any case, write
the new checksum to the page structure.

This gives less coverage, but has virtually no impact on system performance
(except for power consumption).

-- 
	-Colin