ECC and DMA to/from disk controllers
From: Bruce Allen
Date: Mon Sep 10 2007 - 09:26:39 EST
Dear LKML,
Apologies in advance for potential mis-use of LKML, but I don't know where
else to ask.
An ongoing study on datasets of several Petabytes have shown that there
can be 'silent data corruption' at rates much larger than one might
naively expect from the expected error rates in RAID arrays and the
expected probability of single bit uncorrected errors in hard disks.
The origin of this data corruption is still unknown. See for example
http://cern.ch/Peter.Kelemen/talk/2007/kelemen-2007-C5-Silent_Corruptions.pdf
In thinking about this, I began to wonder about the following. Suppose
that a (possibly RAID) disk controller correctly reads data from disk and
has correct data in the controller memory and buffers. However when that
data is DMA'd into system memory some errors occur (cosmic rays,
electrical noise, etc). Am I correct that these errors would NOT be
detected, even on a 'reliable' server with ECC memory? In other words the
ECC bits would be calculated in server memory based on incorrect data from
the disk.
The alternative is that disk controllers (or at least ones that are meant
to be reliable) DMA both the data AND the ECC byte into system memory.
So that if an error occurs in this transfer, then it would most likely be
picked up and corrected by the ECC mechanism. But I don't think that
'this is how it works'. Could someone knowledgable please confirm or
contradict?
Cheers,
Bruce
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/