Re: marching through all physical memory in software

From: Henrique de Moraes Holschuh
Date: Sat Jan 31 2009 - 08:43:43 EST


On Sat, 31 Jan 2009, Tim Small wrote:
> Eric W. Biederman wrote:
> > At the point we are talking about software scrubbing it makes sense to assume
> > a least common denominator memory controller, one that does not do automatic
> > write-back of the corrected value, as all of the recent memory controllers
> > do scrubbing in hardware.
> >
>
> I was just trying to clarify the distinction between the two processes
> which have similar names, but aren't (IMO) actually that similar:
>
> "Software Scrubbing"
>
> Triggering a read, and subsequent rewrite of a particular RAM location
> which has suffered a correctable ECC error(s) i.e. hardware detects an
> error, then the OS takes care of the rewrite to "scrub" the error in the
> case that the hardware doesn't handle this automatically.
>
> This should be a very-occasional error-path process, and performance is
> probably not critical..
>
>
> "Background Scrubbing"
>
> . This is a poor name, IMO (scrub infers some kind of write to me),
> which applies to a process whereby you ensure that the ECC check-bits
> are verified periodically for the whole of physical RAM, so that single
> bit errors in a given ECC block don't accumulate and turn into
> uncorrectable errors. It may also lead to improved data collection for
> some failure modes. Again, many memory controllers implement this
> feature in hardware, so we shouldn't do it twice where this is supported.

It is implined in the background scrubbing, that if a background scrub
page read causes an ECC correctable error to be flagged, the normal
"fix through scrub" behaviour of the memory controller will be
triggered (possibly, the software scrubbing described above).

And if an uncorretable error is detected during the scrub, we have to
do something about it as well. And that won't be that easy: locate
whatever process is using that page, and so something smart to it...
or do some emergency evasive actions if it is one of the kernel's data
scructures, etc.

So, as you said, "background scrubbing" and "software scrubbing" really are
very different things, and one has to expect that background scrubbing will
eventually trigger software scrubbing, major system emergency handling
(uncorrectable errors in kernel memory) or minor system emergency
handling (uncorrectable errors in process memory).

> There is (AFAIK) no need to do any writes here, and in fact doing so is

One might want the possibility of doing inconditional writes, because
it helps with memory bitrot on crappy hardware where the refresh
cycles aren't enough to avoid bitrot. But you definately won't want
it most of the time.

You can also implement software-based ECC using a background scrubber
and setting aside pages to store the ECC information. Now, THAT is
probably not worth bothering with due to the performance impact, but
who knows...

--
"One disk to rule them all, One disk to find them. One disk to bring
them all and in the darkness grind them. In the Land of Redmond
where the shadows lie." -- The Silicon Valley Tarot
Henrique Holschuh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/