Re: the " 'official' point of view" expressed by kernelnewbies.orgregarding reiser4 inclusion
From: David Masover
Date: Tue Aug 01 2006 - 13:38:31 EST
Alan Cox wrote:
Ar Maw, 2006-08-01 am 11:44 -0500, ysgrifennodd David Masover:
Wait, what? Disks, at least, would be protected by RAID. Are you
telling me RAID won't detect such an error?
RAID deals with the case where a device fails. RAID 1 with 2 disks can
in theory detect an internal inconsistency but cannot fix it.
Still, if it does that, that should be enough. The scary part wasn't
that there's an internal inconsistency, but that you wouldn't know.
And it can fix it if you can figure out which disk went. Or give it 3
disks and it should be entirely automatic -- admin gets paged, admin
hotswaps in a new disk, done.
we're OK with that, so long as our filesystems are robust enough. If
it's an _undetected_ error, doesn't that cause way more problems
(impossible problems) than FS corruption? Ok, your FS is fine -- but
now your bank database shows $1k less on random accounts -- is that ok?
Not really no. Your bank is probably using a machine (hopefully using a
machine) with ECC memory, ECC cache and the like. The UDMA and SATA
storage subsystems use CRC checksums between the controller and the
device. SCSI uses various similar systems - some older ones just use a
parity bit so have only a 50/50 chance of noticing a bit error.
Similarly the media itself is recorded with a lot of FEC (forward error
correction) so will spot most changes.
Unfortunately when you throw this lot together with astronomical amounts
of data you get burned now and then, especially as most systems are not
using ECC ram, do not have ECC on the CPU registers and may not even
have ECC on the caches in the disks.
It seems like this is the place to fix it, not the software. If the
software can fix it easily, great. But I'd much rather rely on the
hardware looking after itself, because when hardware goes bad, all bets
Specifically, it seems like you do mention lots of hardware solutions,
that just aren't always used. It seems like storage itself is getting
cheap enough that it's time to step back a year or two in Moore's Law to
get the reliability.
The sort of changes this needs hit the block layer and ever fs.
Seems it would need to hit every application also...
Depending how far you propogate it. Someone people working with huge
data sets already write and check user level CRC values for this reason
(in fact bitkeeper does it for one example). It should be relatively
cheap to get much of that benefit without doing application to
application just as TCP gets most of its benefit without going app to
And yet, if you can do that, I'd suspect you can, should, must do it at
a lower level than the FS. Again, FS robustness is good, but if the
disk itself is going, what good is having your directory (mostly) intact
if the files themselves have random corruptions?
If you can't trust the disk, you need more than just an FS which can
mostly survive hardware failure. You also need the FS itself (or maybe
the block layer) to support bad block relocation and all that good
stuff, or you need your apps designed to do that job by themselves.
It just doesn't make sense to me to do this at the FS level. You
mention TCP -- ok, but if TCP is doing its job, I shouldn't also need to
implement checksums and other robustness at the protocol layer (http,
ftp, ssh), should I? Because in this analogy, it looks like TCP is the
"block layer" and a protocol is the "fs".
As I understand it, TCP only lets the protocol/application know when
something's seriously FUBARed and it has to drop the connection.
Similarly, the FS (and the apps) shouldn't have to know about hardware
problems until it really can't do anything about it anymore, at which
point the right thing to do is for the FS and apps to go "oh shit" and
drop what they're doing, and the admin replaces hardware and restores
from backup. Or brings a backup server online, or...
I guess my main point was that _undetected_ problems are serious, but if
you can detect them, and you have at least a bit of redundancy, you
should be good. For instance, if your RAID reports errors that it can't
fix, you bring that server down and let the backup server run.
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/