Re: [RFC][PATCH] MD RAID Acceleration: Move stripe operations outside the spin lock

From: Dan Williams
Date: Fri May 19 2006 - 14:07:59 EST


This certainly looks like it is heading in the right direction - thanks.

I have a couple of questions, which will probably lead to more.

You obviously need some state-machine functionality to oversee the
progression like xor - drain - xor (for RMW) or clear - copy - xor
(for RCW).
You have encoded the different states on a per block basis (storing it
in sh->dev[x].flags) rather than on a per-strip basis (and so encoding
it in sh->state).
What was the reason for this choice?

The only reason I can think of is to allow more parallelism :
different blocks within a strip can be in different states. I cannot
see any value in this as the 'xor' operation will work across all
(active) blocks in the strip and so you will have to synchronise all
the blocks on that operation.

I feel the code would be simpler if the state was in the strip rather
than the block.
I agree, I started out down this path but then I had trouble
determining which blocks were part of the operation without the
overhead of taking the sh->lock. So, the original decision was based
on the design goal of minimizing the locking required in the
workqueue. But your questions sparked a realization about how to
rework this, although I think it requires the wait_for_block_op queue,
see below...

The wait_for_block_op queue and it's usage seems odd to me.
handle_stripe should never have to wait for anything.
when a block_op is started, the sh->count should be incremented, and
the decremented when the block-ops have finished. Only when will
handle_stripe get to look at the stripe_head again. So waiting
shouldn't be needed.
The case I was concerned about was when two threads are at the top of
handle_stripe, one gets in and kicks off a block op while the other
spins. handle_write_operations makes the assumption that it can
blindly advance the state without worrying about whether the workqueue
has had a chance to dequeue the request. Without the wait, the
workqueue races with the 2nd thread coming out of the spin lock. I.e.
the thread in handle stripe may advance the state before the request
is dequeued in the workqueue. The other purpose is that it allows the
workqueue to read the state bits without taking a lock. At run time
it appears that it does not take the wait path very often, it takes it
once at the beginning of a set of operations but after the first strip
is recycled on to the handle list there are enough strips queued up
such that it does not need to wait.

As I wrote "The other purpose is that it allows the workqueue to read
the state bits without taking a lock" I realized that I can rewrite
the code to have the state information solely in sh->state.

Your GFP_KERNEL kmalloc is a definite no-no which can lead to
deadlocks (it might block while trying to write data out thought the
same raid array). At least it should be GFP_NOIO.
Ahh, I did not consider the case of I/O out the same md device. A
hypothetical question though, would we keep recursing (threads
sleeping waiting for I/O) to a point where kmalloc would eventually
return 0 (i.e. to a point where we can take the allocation failure
path), or would we deadlock at some point?

However a better solution would be to create and use a mempool - they
are designed for exactly this sort of usage.
However I'm not sure if even that is needed. Can a stripe_head have
move than one active block_ops task happening? If not, the
'stripe_work' should be embedded in the 'stripe_head'.
The 'more than one active blocks_ops' case I am considering is a read
(bio fill) that is issued in between the bio drain and xor operation
for a write. However, another realization, with the wait_for_block_op
queue it should be ok to have the stripe_work embedded in the
stripe_head since the next operation will not be issued until the
first is dequeued.

There will probably be more questions once these are answered, but as
the code is definitely a good start.

Thanks for being patient with my attempts.

(*) Since reading the DDF documentation again, I've realised that
using the word 'stripe' both for a chunk-wide stripe and a block-wide
stripe is very confusing. So I've started using 'strip' for a
block-wide stripe. Thus a 'stripe_head' should really be a

I hope this doesn't end up being even more confusing :-)

I actually think this will make things less confusing.

A question, is it inefficient to discuss this with the 2.6.17-rc state
of the code, should I move over to the latest -mm?

Thanks again for your help,

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at