The idea is basically to get a bit more precision out of the kernel
rebuild memory test which seems to catch more errors than anything else.
I was thinking of adding a word to the page structure which would be a
checksum of all read-only pages in memory. (And a checksum of pages
swapped to disk would detect bad SCSI cables and so on.)
When the page is about to get un-read-protected, freed, or is otherwise
losing it's read-only status, you do the checksum again and complain
if anything has changed.
Locating signle-bit errors is equivalent to correcting them, since
the repair obviously consists of just flipping the bit.
In a 4K page, there are 32K possible single-bit errors. There are 2^29
possible double-bit errors, so that's analyzable with a particularly
good error control code, but I confess that I don't know how to do it
myself. You could also use a single-error-correcting 16-bit Reed-Solomon
code to detect any 16-bit word in error, or there are other schemes.
To do this "perfectly", checksumming at the first possible moment and
verifying at the last, for maximum coverage, would definitely slow a
machine down (thrash the cache terribly!), but it would be a real boon
to folks with erratic memory problems.
And it could probably be adapted to a kind of background mode where the
idle task walks through pages, and if they're read-only, computes a checksum.
If they haven't been changed since the last time a checksum was computed,
but the checksum differs, we have memory corruption. In any case, write
the new checksum to the page structure.
This gives less coverage, but has virtually no impact on system performance
(except for power consumption).
-- -Colin