Re: [dm-devel] [RFC] block: fix blk_queue_split() resource exhaustion

From: Lars Ellenberg
Date: Fri Jul 08 2016 - 04:02:39 EST


On Fri, Jul 08, 2016 at 08:07:52AM +1000, NeilBrown wrote:
> Before I introduced the recursion limiting, requests were handled as an
> in-order tree walk. The code I wrote tried to preserve that but didn't
> for several reasons. I think we need to restore the original in-order
> walk because it makes the most sense.
> So after processing a particular bio, we should then process all the
> 'child' bios - bios send to underlying devices. Then the 'sibling'
> bios, that were split off, and then any remaining parents and ancestors.
>
> You patch created the right structures for doing this, my proposal took
> it a step closer, but now after more careful analysis I don't think it
> is quite right.
> With my previous proposal (and you latest patch - thanks!) requests for
> "this" level are stacked, but they should be queued.
> If a make_request_fn only ever submits one request for this level and
> zero or more lower levels, then the difference between a queue and a
> stack is irrelevant. If it submited more that one, a stack would cause
> them to be handled in the reverse order.

We have a device stack.
q_this_level->make_request_fn() cannot possibly submit anything
on "this_level", or it would create a device loop, I think.

So we start with the initial, "top most" call to generic_make_request().
That is one single bio. All queues are empty.

This bio is then passed on to its destination queue make_request_fn().

Which may chose to split it (via blk_queue_split, or like dm does, or
else). If it, like blk_queue_split() does, splits it into
"piece-I-can-handle-now" and "remainder", both still targeted at the
top most (current) queue, I think the "remainder" should just be pushed
back, which will make it look as if upper layers did
generic_make_request("piece-I-can-handle-now");
generic_make_request("remainder");
Which I do, by using bio_list_add_head(remainder, bio); (*head*).

I don't see any other way for a make_request_fn(bio(l=x)) to generate
"sibling" bios to the same level (l=x) as its own argument.

This same q(l=x)->make_request_fn(bio(l=x)) may now call
generic_make_request() for zero or more "child" bios (l=x+1),
which are queued in order: bio_list_add(recursion, bio); (*tail*).
Then, once l=x returns, the queue generated by it is spliced
in front of the "remainder" (*head*).
All bios are processed in the order they have been queued,
by peeling off of the head.

After all "child" bios of level l>=x+1 have been processed,
the next bio to be processed will be the "pushed back" remainder.

All "Natural order".

> To make the patch "perfect", and maybe even more elegant we could treat
> ->remainder and ->recursion more alike.
> i.e.:
> - generic make request has a private "stack" of requests.
> - before calling ->make_request_fn(), both ->remainder and ->recursion
> are initialised
> - after ->make_request_fn(), ->remainder are spliced in to top of
> 'stack', then ->recursion is spliced onto that.
> - If stack is not empty, the top request is popped and we loop to top.
>
> This reliably follows in-order execution, and handles siblings correctly
> (in submitted order) if/when a request splits off multiple siblings.

The only splitting that creates siblings on the current level
is blk_queue_split(), which splits the current bio into
"front piece" and "remainder", already processed in this order.

Anything else creating "siblings" is not creating siblings for the
current layer, but for the next deeper layer, which are queue on
"recursion" and also processed in the order they have been generated.

> I think that as long a requests are submitted in the order they are
> created at each level there is no reason to expect performance
> regressions.
> All we are doing is changing the ordering between requests generated at
> different levels, and I think we are restoring a more natural order.

I believe both patches combined are doing exactly this already.
I could rename .remainder to .todo or .incoming, though.

.incoming = [ bio(l=0) ]
.recursion = []

split

.incoming = [ bio(l=0,now_1), bio(l=0,remainder_1) ]
.recursion = []

process head of .incoming

.incoming = [ bio(l=0,remainder_1) ]
.recursion = [ bio(l=1,a), bio(l=1,b), bio(l=1,c), ... ]

merge_head

.incoming = [ bio(l=1,a), bio(l=1,b), bio(l=1,c), ...,
bio(l=0,remainder_1) ]
.recursion = []

process head of .incoming, potentially split first

.incoming = [ bio(l=1,a,now), bio(l=1,a,remainder), bio(l=1,b), bio(l=1,c), ...,
bio(l=0,remainder_1) ]
...
.incoming = [ bio(l=1,a,remainder), bio(l=1,b), bio(l=1,c), ...,
bio(l=0,remainder_1) ]
.recursion = [ bio(l=2,aa), bio(l=2,ab), ... ]

merge_head

.incoming = [ bio(l=2,aa), bio(l=2,ab), ...,
bio(l=1,a,remainder), bio(l=1,b), bio(l=1,c), ...,
bio(l=0,remainder_1) ]
.recursion = []

...

process away ... until back at l=0

.incoming = [ bio(l=0,remainder_1) ]
.recursion = []

potentially split further
.incoming = [ bio(l=0,now_2), bio(l=0,remainder_2) ]
.recursion = []

rinse, repeat.

Thanks,

Lars Ellenberg