Re: limits on raid

From: Martin K. Petersen
Date: Thu Jun 21 2007 - 14:31:30 EST


>>>>> "Mattias" == Mattias Wadenstein <maswan@xxxxxxxxxx> writes:

Mattias> In theory, that's how storage should work. In practice,
Mattias> silent data corruption does happen. If not from the disks
Mattias> themselves, somewhere along the path of cables, controllers,
Mattias> drivers, buses, etc. If you add in fcal, you'll get even more
Mattias> sources of failure, but usually you can avoid SANs (if you
Mattias> care about your data).

Oracle cares a lot about people's data 8). And we've seen many cases
of silent data corruption. Often the problem goes unnoticed for
months. And by the time you find out about it you may have gone
through your backup cycle so the data is simply lost.

The Oracle database in combination with certain high-end arrays
supports a technology called HARD (Hardware Assisted Resilient Data)
which allows the array front end to verify the integrity of an I/O
before committing it to disk. The downside to HARD is that it's
proprietary and only really high-end customers use it (many
enterprises actually mandate HARD).

A couple of years ago some changes started to trickle into the SCSI
Block Commands spec. And as some of you know I've been working on
implementing support for this Data Integrity Field in Linux.

What DIF allows you to do is to attach some integrity metadata to an
I/O. We can attach this metadata all the way up in the userland
application context where the risk of corruption is relatively small.
The metadata passes all the way through the I/O stack, gets verified
by the HBA firmware, through the fabric, gets verified by the array
front end and finally again by the disk drive before the change is
committed to platter. Any discrepancy will cause the I/O to be
failed. And thanks to the intermediate checks you also get fault
isolation.

The DIF integrity metadata contains a CRC of the data block as well as
a reference tag that (for Type 1) needs to match the target sector on
disk. This way the common problem of misdirected writes can be
alleviated.

Initially, DIF is going to be offered in the FC/SAS space. But I
encourage everybody to lean on their SATA drive manufacturer of choice
and encourage them to provide a similar functionality on consumer or
at the very least nearline drives.


Note there's a difference between FS checksums and DIF. Filesystem
checksums (plug: http://oss.oracle.com/projects/btrfs/) allows the
filesystem to detect that it read something bad. And as discussed
earlier we can potentially retry the read from another mirror or
reconstruct in the case of RAID5/6.

DIF, however, is a proactive technology. It prevents bad stuff from
being written to disk in the first place. You'll know right away when
corruption happens, not 4 months later when you try to read the data
back.

So DIF and filesystem checksumming go hand in hand in preventing data
corruption...

--
Martin K. Petersen Oracle Linux Engineering

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/