Re: [PATCH] blk: improve order of bio handling in generic_make_request()

From: Jack Wang
Date: Fri Mar 03 2017 - 04:30:01 EST




On 03.03.2017 06:14, NeilBrown wrote:
>
> [ Hi Jens,
> you might have seen assorted email threads recently about
> deadlocks, particular in dm-snap or md/raid1/10. Also about
> the excess of rescuer threads.
> I think a big part of the problem is my ancient improvement
> to generic_make_request to queue bios and handle them in
> a strict FIFO order. As described below, that can cause
> problems which individual block devices cannot fix themselves
> without punting to various threads.
> This patch does not fix everything, but provides a basis that
> drives can build on to create dead-lock free solutions without
> excess threads.
> If you accept this, I will look into improving at least md
> and bio_alloc_set() to be less dependant on rescuer threads.
> Thanks,
> NeilBrown
> ]
>
>
> To avoid recursion on the kernel stack when stacked block devices
> are in use, generic_make_request() will, when called recursively,
> queue new requests for later handling. They will be handled when the
> make_request_fn for the current bio completes.
>
> If any bios are submitted by a make_request_fn, these will ultimately
> handled seqeuntially. If the handling of one of those generates
> further requests, they will be added to the end of the queue.
>
> This strict first-in-first-out behaviour can lead to deadlocks in
> various ways, normally because a request might need to wait for a
> previous request to the same device to complete. This can happen when
> they share a mempool, and can happen due to interdependencies
> particular to the device. Both md and dm have examples where this happens.
>
> These deadlocks can be erradicated by more selective ordering of bios.
> Specifically by handling them in depth-first order. That is: when the
> handling of one bio generates one or more further bios, they are
> handled immediately after the parent, before any siblings of the
> parent. That way, when generic_make_request() calls make_request_fn
> for some particular device, it we can be certain that all previously
> submited request for that device have been completely handled and are
> not waiting for anything in the queue of requests maintained in
> generic_make_request().
>
> An easy way to achieve this would be to use a last-in-first-out stack
> instead of a queue. However this will change the order of consecutive
> bios submitted by a make_request_fn, which could have unexpected consequences.
> Instead we take a slightly more complex approach.
> A fresh queue is created for each call to a make_request_fn. After it completes,
> any bios for a different device are placed on the front of the main queue, followed
> by any bios for the same device, followed by all bios that were already on
> the queue before the make_request_fn was called.
> This provides the depth-first approach without reordering bios on the same level.
>
> This, by itself, it not enough to remove the deadlocks. It just makes
> it possible for drivers to take the extra step required themselves.
>
> To avoid deadlocks, drivers must never risk waiting for a request
> after submitting one to generic_make_request. This includes never
> allocing from a mempool twice in the one call to a make_request_fn.
>
> A common pattern in drivers is to call bio_split() in a loop, handling
> the first part and then looping around to possibly split the next part.
> Instead, a driver that finds it needs to split a bio should queue
> (with generic_make_request) the second part, handle the first part,
> and then return. The new code in generic_make_request will ensure the
> requests to underlying bios are processed first, then the second bio
> that was split off. If it splits again, the same process happens. In
> each case one bio will be completely handled before the next one is attempted.
>
> With this is place, it should be possible to disable the
> punt_bios_to_recover() recovery thread for many block devices, and
> eventually it may be possible to remove it completely.
>
> Tested-by: Jinpu Wang <jinpu.wang@xxxxxxxxxxxxxxxx>
> Inspired-by: Lars Ellenberg <lars.ellenberg@xxxxxxxxxx>
> Signed-off-by: NeilBrown <neilb@xxxxxxxx>
> ---
> block/blk-core.c | 22 ++++++++++++++++++++++
> 1 file changed, 22 insertions(+)
>
> diff --git a/block/blk-core.c b/block/blk-core.c
> index b9e857f4afe8..ef55f210dd7c 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -2018,10 +2018,32 @@ blk_qc_t generic_make_request(struct bio *bio)
> struct request_queue *q = bdev_get_queue(bio->bi_bdev);
>
> if (likely(blk_queue_enter(q, false) == 0)) {
> + struct bio_list hold;
> + struct bio_list lower, same;
> +
> + /* Create a fresh bio_list for all subordinate requests */
> + bio_list_init(&hold);
> + bio_list_merge(&hold, &bio_list_on_stack);
> + bio_list_init(&bio_list_on_stack);
> ret = q->make_request_fn(q, bio);
>
> blk_queue_exit(q);
>
> + /* sort new bios into those for a lower level
> + * and those for the same level
> + */
> + bio_list_init(&lower);
> + bio_list_init(&same);
> + while ((bio = bio_list_pop(&bio_list_on_stack)) != NULL)
> + if (q == bdev_get_queue(bio->bi_bdev))
> + bio_list_add(&same, bio);
> + else
> + bio_list_add(&lower, bio);
> + /* now assemble so we handle the lowest level first */
> + bio_list_merge(&bio_list_on_stack, &lower);
> + bio_list_merge(&bio_list_on_stack, &same);
> + bio_list_merge(&bio_list_on_stack, &hold);
> +
> bio = bio_list_pop(current->bio_list);
> } else {
> struct bio *bio_next = bio_list_pop(current->bio_list);
>

Thanks Neil for pushing the fix.

We can optimize generic_make_request a little bit:
- assign bio_list struct hold directly instead init and merge
- remove duplicate code

I think better to squash into your fix.
---
block/blk-core.c | 9 ++-------
1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 3bc7202..b29b7e5 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2147,8 +2147,7 @@ blk_qc_t generic_make_request(struct bio *bio)
struct bio_list lower, same, hold;

/* Create a fresh bio_list for all subordinate requests */
- bio_list_init(&hold);
- bio_list_merge(&hold, &bio_list_on_stack);
+ hold = bio_list_on_stack;
bio_list_init(&bio_list_on_stack);

ret = q->make_request_fn(q, bio);
@@ -2168,14 +2167,10 @@ blk_qc_t generic_make_request(struct bio *bio)
bio_list_merge(&bio_list_on_stack, &lower);
bio_list_merge(&bio_list_on_stack, &same);
bio_list_merge(&bio_list_on_stack, &hold);
-
- bio = bio_list_pop(current->bio_list);
} else {
- struct bio *bio_next = bio_list_pop(current->bio_list);
-
bio_io_error(bio);
- bio = bio_next;
}
+ bio = bio_list_pop(current->bio_list);
} while (bio);
current->bio_list = NULL; /* deactivate */

--

Regards,
Jack