Re: [patch] ext2/3: document conditions when reliable operation ispossible

From: Pavel Machek
Date: Mon Aug 24 2009 - 16:52:27 EST


Hi!

>> Yep, and at that point you lost data. You had "silent data corruption"
>> from fs point of view, and that's bad.
>>
>> It will be probably very bad on XFS, probably okay on Ext3, and
>> certainly okay on Ext2: you do filesystem check, and you should be
>> able to repair any damage. So yes, physical journaling is good, but
>> fsck is better.
>
> I don't see why you think that. In general, fsck (for any fs) only
> checks metadata. If you have silent data corruption that corrupts things
> that are fixable by fsck, you most likely have silent corruption hitting
> things users care about like their data blocks inside of files. Fsck
> will not fix (or notice) any of that, that is where things like full
> data checksums can help.

Ok, but in case of data corruption, at least your filesystem does not
degrade further.

>> If those filesystem assumptions were not documented, I'd call it
>> filesystem bug. So better document them ;-).
>>
> I think that we need to help people understand the full spectrum of data
> concerns, starting with reasonable best practices that will help most
> people suffer *less* (not no) data loss. And make very sure that they
> are not falsely assured that by following any specific script that they
> can skip backups, remote backups, etc :-)
>
> Nothing in our code in any part of the kernel deals well with every
> disaster or odd event.

I can reproduce data loss with ext3 on flashcard in about 40
seconds. I'd not call that "odd event". It would be nice to handle
that, but that is hard. So ... can we at least get that documented
please?


>> Actually, ext2 should be able to survive that, no? Error writing ->
>> remount ro -> fsck on next boot -> drive relocates the sectors.
>>
>
> I think that the example and the response are both off base. If your
> head ever touches the platter, you won't be reading from a huge part of
> your drive ever again (usually, you have 2 heads per platter, 3-4
> platters, impact would kill one head and a corresponding percentage of
> your data).

Ok, that's obviously game over.

>>> It's for this reason that I've never been completely sure how useful
>>> Pavel's proposed treatise about file systems expectations really are
>>> --- because all storage subsystems *usually* provide these guarantees,
>>> but it is the very rare storage system that *always* provides these
>>> guarantees.
>>
>> Well... there's very big difference between harddrives and flash
>> memory. Harddrives usually work, and flash memory never does.
>
> It is hard for anyone to see the real data without looking in detail at
> large numbers of parts. Back at EMC, we looked at failures for lots of
> parts so we got a clear grasp on trends. I do agree that flash/SSD
> parts are still very young so we will have interesting and unexpected
> failure modes to learn to deal with....

_Maybe_ SSDs, being HDD replacements are better. I don't know.

_All_ flash cards (MMC, USB, SD) had the problems. You don't need to
get clear grasp on trends. Those cards just don't meet ext3
expectations, and if you pull them, you get data loss.

>>> We could just as easily have several kilobytes of explanation in
>>> Documentation/* explaining how we assume that DRAM always returns the
>>> same value that was stored in it previously --- and yet most PC class
>>> hardware still does not use ECC memory, and cosmic rays are a reality.
>>> That means that most Linux systems run on systems that are vulnerable
>>> to this kind of failure --- and the world hasn't ended.

>> There's a difference. In case of cosmic rays, hardware is clearly
>> buggy. I have one machine with bad DRAM (about 1 errors in 2 days),
>> and I still use it. I will not complain if ext3 trashes that.
>>
>> In case of degraded raid-5, even with perfect hardware, and with
>> ext3 on top of that, you'll get silent data corruption. Nice, eh?
>>
>> Clearly, Linux is buggy there. It could be argued it is raid-5's
>> fault, or maybe it is ext3's fault, but... linux is still buggy.
>
> Nothing is perfect. It is still a trade off between storage utilization
> (how much storage we give users for say 5 2TB drives), performance and
> costs (throw away any disks over 2 years old?).

"Nothing is perfect"?! That's design decision/problem in raid5/ext3. I
believe that should be at least documented. (And understand why ZFS is
interesting thing).

>> Ext3 is unsuitable for flash cards and RAID arrays, plain and
>> simple. It is not documented anywhere :-(. [ext2 should work better --
>> at least you'll not get silent data corruption.]
>
> ext3 is used on lots of raid arrays without any issue.

And I still use my zaurus with crappy DRAM.

I would not trust raid5 array with my data, for multiple
reasons. The fact that degraded raid5 breaks ext3 assumptions should
really be documented.

>> I hold ext2/ext3 to higher standards than other filesystem in
>> tree. I'd not use XFS/VFAT etc.
>>
>> I would not want people to migrate towards XFS/VFAT, and yes I believe
>> XFSs/VFATs/... requirements should be documented, too. (But I know too
>> little about those filesystems).
>>
>> If you can suggest better wording, please help me. But... those
>> requirements are non-trivial, commonly not met and the result is data
>> loss. It has to be documented somehow. Make it as innocent-looking as
>> you can...

>
> I think that you really need to step back and look harder at real
> failures - not just your personal experience - but a larger set of real
> world failures. Many papers have been published recently about that (the
> google paper, the Bianca paper from FAST, Netapp, etc).

The papers show failures in "once a year" range. I have "twice a
minute" failure scenario with flashdisks.

Not sure how often "degraded raid5 breaks ext3 atomicity" would bite,
but I bet it would be on "once a day" scale.

We should document those.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/