All,
On the mdraid list, there was a recent thread about using raid
functionality to detect / repair silent corruption.
The issues brought up were that a lot of silent data corruption occurs
when cables, controllers, power supplies, ram, cache, etc. goes bad.
It made me think about another option for detecting silent corruption
I have not seen discussed, but maybe I missed it.
Aiui, the ATA spec allows for the reading of a long sector as well as
the normal 512 byte sector. When you get a long sector you also get
the CRC (or whatever checksum data there is on the disk that allows
the drive itself to detect media errors).
I don't have any idea how easy or hard it would be to do, but I would
like to see the entire block subsystem enhanced to optionally allow
long sector reads to be used in a "paranoid" fashion.
Effectively it would be:
1) Read long sector from drive: verify CRC in kernel. This tests
most everything on the i/o path.
2) maintain CRC type information in block subsystem. Verify no
corruption just before handing off to userspace. This would
potentially identify CPU/cache/RAM failures.