Re: limits on raid

From: David Greaves
Date: Fri Jun 22 2007 - 12:56:09 EST


Bill Davidsen wrote:
David Greaves wrote:
david@xxxxxxx wrote:
On Fri, 22 Jun 2007, David Greaves wrote:
If you end up 'fiddling' in md because someone specified --assume-clean on a raid5 [in this case just to save a few minutes *testing time* on system with a heavily choked bus!] then that adds *even more* complexity and exception cases into all the stuff you described.

A "few minutes?" Are you reading the times people are seeing with multi-TB arrays? Let's see, 5TB at a rebuild rate of 20MB... three days.
Yes. But we are talking initial creation here.

And as soon as you believe that the array is actually "usable" you cut that rebuild rate, perhaps in half, and get dog-slow performance from the array. It's usable in the sense that reads and writes work, but for useful work it's pretty painful. You either fail to understand the magnitude of the problem or wish to trivialize it for some reason.
I do understand the problem and I'm not trying to trivialise it :)

I _suggested_ that it's worth thinking about things rather than jumping in to say "oh, we can code up a clever algorithm that keeps track of what stripes have valid parity and which don't and we can optimise the read/copy/write for valid stripes and use the raid6 type read-all/write-all for invalid stripes and then we can write a bit extra on the check code to set the bitmaps......"

Phew - and that lets us run the array at semi-degraded performance (raid6-like) for 3 days rather than either waiting before we put it into production or running it very slowly.
Now we run this system for 3 years and we saved 3 days - hmmm IS IT WORTH IT?

What happens in those 3 years when we have a disk fail? The solution doesn't apply then - it's 3 days to rebuild - like it or not.

By delaying parity computation until the first write to a stripe only the growth of a filesystem is slowed, and all data are protected without waiting for the lengthly check. The rebuild speed can be set very low, because on-demand rebuild will do most of the work.
I am not saying you are wrong.
I ask merely if the balance of benefit outweighs the balance of complexity.

If the benefit were 24x7 then sure - eg using hardware assist in the raid calcs - very useful indeed.

I'm very much for the fs layer reading the lower block structure so I don't have to fiddle with arcane tuning parameters - yes, *please* help make xfs self-tuning!

Keeping life as straightforward as possible low down makes the upwards interface more manageable and that goal more realistic...

Those two paragraphs are mutually exclusive. The fs can be simple because it rests on a simple device, even if the "simple device" is provided by LVM or md. And LVM and md can stay simple because they rest on simple devices, even if they are provided by PATA, SATA, nbd, etc. Independent layers make each layer more robust. If you want to compromise the layer separation, some approach like ZFS with full integration would seem to be promising. Note that layers allow specialized features at each point, trading integration for flexibility.

That's a simplistic summary.
You *can* loosely couple the layers. But you can enrich the interface and tightly couple them too - XFS is capable (I guess) of understanding md more fully than say ext2.
XFS would still work on a less 'talkative' block device where performance wasn't as important (USB flash maybe, dunno).


My feeling is that full integration and independent layers each have benefits, as you connect the layers to expose operational details you need to handle changes in those details, which would seem to make layers more complex.
Agreed.

What I'm looking for here is better performance in one particular layer, the md RAID5 layer. I like to avoid unnecessary complexity, but I feel that the current performance suggests room for improvement.

I agree there is room for improvement.
I suggest that it may be more fruitful to write a tool called "raid5prepare"
that writes zeroes/ones as appropriate to all component devices and then you can use --assume-clean without concern. That could look to see if the devices are scsi or whatever and take advantage of the hyperfast block writes that can be done.

David
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/