On Thursday 27 August 2009 06:43:49 Ric Wheeler wrote:
On 08/26/2009 11:53 PM, Rob Landley wrote:No, I'm dismissing the papers (some of which I read when they first came out
On Tuesday 25 August 2009 18:40:50 Ric Wheeler wrote:You are missing the broader point of both papers.
Repeat experiment until you get up to something like google scale or theOn google scale anvil lightning can fry your machine out of a clear sky.
other papers on failures in national labs in the US and then we can have
an informed discussion.
However, there are still a few non-enterprise users out there, and
knowing that specific usage patterns don't behave like they expect might
be useful to them.
and got slashdotted) as irrelevant to the topic at hand.
Pavel has two failure modes which he can trivially reproduce. The USB stick
one is reproducible on a laptop by jostling said stick. I myself used to have
a literal USB keychain, and the weight of keys dangling from it pulled it out
of the USB socket fairly easily if I wasn't careful. At the time nobody had
told me a journaling filesystem was not a reasonable safeguard here.
Presumably the degraded raid one can be reproduced under an emulator, with no
hardware directly involved at all, so talking about hardware failure rates
ignores the fact that he's actually discussing a _software_ problem. It may
happen in _response_ to hardware failures, but the damage he's attempting to
document happens entirely in software.
These failure modes can cause data loss which journaling can't help, but which
journaling might (or might not) conceivably hide so you don't immediately
notice it. They share a common underlying assumption that the storage
device's update granularity is less than or equal to the filesystem's block
size, which is not actually true of all modern storage devices. The fact he's
only _found_ two instances where this assumption bites doesn't mean there
aren't more waiting to be found, especially as more new storage media types
Pavel's response was to attempt to document this. Not that journaling is
_bad_, but that it doesn't protect against this class of problem.
Your response is to talk about google clusters, cloud storage, and cite
academic papers of statistical hardware failure rates. As I understand the
discussion, that's not actually the issue Pavel's talking about, merely one
potential trigger for it.