Re: [PROPOSAL] Coping with random bit errors

becka@rz.uni-duesseldorf.de
Sat, 11 Oct 1997 11:49:01 +0100 (MET)


> > > level of protection against random bit errors. I shudder to think what
> > > other bit errors have crept into my source tree which don't prevent
> > > compiling :-(
> > > Anyway, I'd like to get some reaction from those who know more about
> > > the page cache implementation as to what they think of this idea?
Well - I wouldn't call it "protection", but it is indeed a nice idea to
_detect_ faulty memory. I have seen lots of RAM modules passing in a
so-called (HW) RAM-tester and then failing in an actual system.

As most of these problems are not 100% reproducible, but depend on timing,
temperature, ... it would be nice to have a kernel-thread(?) which does
this. I wouldn't put it in normally, but when I suspect an error in my RAM,
it would be a good idea.

> > The main problem with it, from a technical standpoint, is that unlike
> > ECC all you know is that a page was corrupted, so you have to throw it
> > out. If it was dirty, or in use, what do you do?
If it is dirty, you can't detect the error, as it might have been
dirtied by a write operation. No checksum compare will work.

I wouldn't view this as some kind of ECC, but rather as something like
parity-RAM. You will get notified when you have bad RAM, and at what address,
so you could investigate further.

If someone writes such a thing, I'd recommend adding another feature to it
to enhance the detection ratio if desired:
Give the module a parameter which will change the refresh rate. This trick
is used in the best SW RAMtest I know to check for "weak" bits. These are
commonly caused by a DRAM cell being discharged too fast, thus showing up
more frequently at low refresh.

CU,Andy

-- 
Andreas Beck              |  Email :  <becka@sunserver1.rz.uni-duesseldorf.de>