On Thursday 27 August 2009 01:54:30 david@xxxxxxx wrote:On Thu, 27 Aug 2009, Rob Landley wrote:
Today we have cheap plentiful USB keys that act like hard drives, except
that their write block size isn't remotely the same as hard drives', but
they pretend it is, and then the block wear levelling algorithms fuzz
things further. (Gee, a drive controller lying about drive geometry, the
scsi crowd should feel right at home.)
actually, you don't know if your USB key works that way or not.
Um, yes, I think I do.
Pavel has ssome that do, that doesn't mean that all flash drives do
Pretty much all the ones that present a USB disk interface to the outside
world and then thus have to do hardware levelling. Here's Valerie Aurora on
the topic:
http://valhenson.livejournal.com/25228.html
Let's start with hardware wear-leveling. Basically, nearly all practical
implementations of it suck. You'd imagine that it would spread out writes
over all the blocks in the drive, only rewriting any particular block after
every other block has been written. But I've heard from experts several
times that hardware wear-leveling can be as dumb as a ring buffer of 12
blocks; each time you write a block, it pulls something out of the queue
and sticks the old block in. If you only write one block over and over,
this means that writes will be spread out over a staggering 12 blocks! My
direct experience working with corrupted flash with built-in wear-leveling
is that corruption was centered around frequently written blocks (with
interesting patterns resulting from the interleaving of blocks from
different erase blocks). As a file systems person, I know what it takes to
do high-quality wear-leveling: it's called a log-structured file system and
they are non-trivial pieces of software. Your average consumer SSD is not
going to have sufficient hardware to implement even a half-assed
log-structured file system, so clearly it's going to be a lot stupider than
that.
Back to you:
when you do a write to a flash drive you have to do the following items
1. allocate an empty eraseblock to put the data on
2. read the old eraseblock
3. merge the incoming write to the eraseblock
4. write the updated data to the flash
5. update the flash trnslation layer to point reads at the new location
instead of the old location.
now if the flash drive does things in this order you will not loose any
previously written data.
That's what something like jffs2 will do, sure. (And note that mounting those
suckers is slow while it reads the whole disk to figure out what order to put
the chunks in.)
However, your average consumer level device A) isn't very smart, B) is judged
almost entirely by price/capacity ratio and thus usually won't even hide
capacity for bad block remapping. You expect them to have significant hidden
capacity to do safer updates with when customers aren't demanding it yet?
if the flash drive does step 5 before it does step 4, then you have a
window where a crash can loose data (and no btrfs won't survive any better
to have a large chunk of data just disappear)
it's possible that some super-cheap flash drives
I've never seen one that presented a USB disk interface that _didn't_ do this.
(Not that this observation means much.) Neither the windows nor the Macintosh
world is calling for this yet. Even the Linux guys barely know about it. And
these are the same kinds of manufacturers that NOPed out the flush commands to
make their benchmarks look better...
but if the device doesn't have a flash translation layer, then repeated
writes to any one sector will kill the drive fairly quickly. (updates to
the FAT would kill the sectors the FAT, journal, root directory, or
superblock lives in due to the fact that every change to the disk requires
an update to this file for example)
Yup. It's got enough of one to get past the warantee, but beyond that they're
intended for archiving and sneakernet, not for running compiles on.
That said, ext3's assumption that filesystem block size always >= disk
update block size _is_ a fundamental part of this problem, and one that
isn't shared by things like jffs2, and which things like btrfs might be
able to address if they try, by adding awareness of the real media update
granularity to their node layout algorithms. (Heck, ext2 has a stripe
size parameter already. Does setting that appropriately for your raid
make this suck less? I haven't heard anybody comment on that one yet...)
I thought that that assumption was in the VFS layer, not in any particular
filesystem
The VFS layer cares about how to talk to the backing store? I thought that
was the filesystem driver's job...
I wonder how jffs2 gets around it, then? (Or for that matter, squashfs...)