[PATCH] blk: improve order of bio handling in generic_make_request()

From: NeilBrown
Date: Fri Mar 03 2017 - 00:15:01 EST



[ Hi Jens,
you might have seen assorted email threads recently about
deadlocks, particular in dm-snap or md/raid1/10. Also about
the excess of rescuer threads.
I think a big part of the problem is my ancient improvement
to generic_make_request to queue bios and handle them in
a strict FIFO order. As described below, that can cause
problems which individual block devices cannot fix themselves
without punting to various threads.
This patch does not fix everything, but provides a basis that
drives can build on to create dead-lock free solutions without
excess threads.
If you accept this, I will look into improving at least md
and bio_alloc_set() to be less dependant on rescuer threads.
Thanks,
NeilBrown
]


To avoid recursion on the kernel stack when stacked block devices
are in use, generic_make_request() will, when called recursively,
queue new requests for later handling. They will be handled when the
make_request_fn for the current bio completes.

If any bios are submitted by a make_request_fn, these will ultimately
handled seqeuntially. If the handling of one of those generates
further requests, they will be added to the end of the queue.

This strict first-in-first-out behaviour can lead to deadlocks in
various ways, normally because a request might need to wait for a
previous request to the same device to complete. This can happen when
they share a mempool, and can happen due to interdependencies
particular to the device. Both md and dm have examples where this happens.

These deadlocks can be erradicated by more selective ordering of bios.
Specifically by handling them in depth-first order. That is: when the
handling of one bio generates one or more further bios, they are
handled immediately after the parent, before any siblings of the
parent. That way, when generic_make_request() calls make_request_fn
for some particular device, it we can be certain that all previously
submited request for that device have been completely handled and are
not waiting for anything in the queue of requests maintained in
generic_make_request().

An easy way to achieve this would be to use a last-in-first-out stack
instead of a queue. However this will change the order of consecutive
bios submitted by a make_request_fn, which could have unexpected consequences.
Instead we take a slightly more complex approach.
A fresh queue is created for each call to a make_request_fn. After it completes,
any bios for a different device are placed on the front of the main queue, followed
by any bios for the same device, followed by all bios that were already on
the queue before the make_request_fn was called.
This provides the depth-first approach without reordering bios on the same level.

This, by itself, it not enough to remove the deadlocks. It just makes
it possible for drivers to take the extra step required themselves.

To avoid deadlocks, drivers must never risk waiting for a request
after submitting one to generic_make_request. This includes never
allocing from a mempool twice in the one call to a make_request_fn.

A common pattern in drivers is to call bio_split() in a loop, handling
the first part and then looping around to possibly split the next part.
Instead, a driver that finds it needs to split a bio should queue
(with generic_make_request) the second part, handle the first part,
and then return. The new code in generic_make_request will ensure the
requests to underlying bios are processed first, then the second bio
that was split off. If it splits again, the same process happens. In
each case one bio will be completely handled before the next one is attempted.

With this is place, it should be possible to disable the
punt_bios_to_recover() recovery thread for many block devices, and
eventually it may be possible to remove it completely.

Tested-by: Jinpu Wang <jinpu.wang@xxxxxxxxxxxxxxxx>
Inspired-by: Lars Ellenberg <lars.ellenberg@xxxxxxxxxx>
Signed-off-by: NeilBrown <neilb@xxxxxxxx>
---
block/blk-core.c | 22 ++++++++++++++++++++++
1 file changed, 22 insertions(+)

diff --git a/block/blk-core.c b/block/blk-core.c
index b9e857f4afe8..ef55f210dd7c 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2018,10 +2018,32 @@ blk_qc_t generic_make_request(struct bio *bio)
struct request_queue *q = bdev_get_queue(bio->bi_bdev);

if (likely(blk_queue_enter(q, false) == 0)) {
+ struct bio_list hold;
+ struct bio_list lower, same;
+
+ /* Create a fresh bio_list for all subordinate requests */
+ bio_list_init(&hold);
+ bio_list_merge(&hold, &bio_list_on_stack);
+ bio_list_init(&bio_list_on_stack);
ret = q->make_request_fn(q, bio);

blk_queue_exit(q);

+ /* sort new bios into those for a lower level
+ * and those for the same level
+ */
+ bio_list_init(&lower);
+ bio_list_init(&same);
+ while ((bio = bio_list_pop(&bio_list_on_stack)) != NULL)
+ if (q == bdev_get_queue(bio->bi_bdev))
+ bio_list_add(&same, bio);
+ else
+ bio_list_add(&lower, bio);
+ /* now assemble so we handle the lowest level first */
+ bio_list_merge(&bio_list_on_stack, &lower);
+ bio_list_merge(&bio_list_on_stack, &same);
+ bio_list_merge(&bio_list_on_stack, &hold);
+
bio = bio_list_pop(current->bio_list);
} else {
struct bio *bio_next = bio_list_pop(current->bio_list);
--
2.11.0

Attachment: signature.asc
Description: PGP signature