Re: [PATCH v2 1/1] block: fix blk_queue_split() resource exhaustion

From: NeilBrown
Date: Fri Jan 06 2017 - 18:01:39 EST


On Sat, Jan 07 2017, Mike Snitzer wrote:

> On Fri, Jan 06 2017 at 12:34pm -0500,
> Mikulas Patocka <mpatocka@xxxxxxxxxx> wrote:
>
>>
>>
>> On Fri, 6 Jan 2017, Mikulas Patocka wrote:
>>
>> >
>> >
>> > On Wed, 4 Jan 2017, Mike Snitzer wrote:
>> >
>> > > On Wed, Jan 04 2017 at 12:12am -0500,
>> > > NeilBrown <neilb@xxxxxxxx> wrote:
>> > >
>> > > > > Suggested-by: NeilBrown <neilb@xxxxxxxx>
>> > > > > Signed-off-by: Jack Wang <jinpu.wang@xxxxxxxxxxxxxxxx>
>> > > > > ---
>> > > > > block/blk-core.c | 20 ++++++++++++++++++++
>> > > > > 1 file changed, 20 insertions(+)
>> > > > >
>> > > > > diff --git a/block/blk-core.c b/block/blk-core.c
>> > > > > index 9e3ac56..47ef373 100644
>> > > > > --- a/block/blk-core.c
>> > > > > +++ b/block/blk-core.c
>> > > > > @@ -2138,10 +2138,30 @@ blk_qc_t generic_make_request(struct bio *bio)
>> > > > > struct request_queue *q = bdev_get_queue(bio->bi_bdev);
>> > > > >
>> > > > > if (likely(blk_queue_enter(q, __GFP_DIRECT_RECLAIM) == 0)) {
>> > > > > + struct bio_list lower, same, hold;
>> > > > > +
>> > > > > + /* Create a fresh bio_list for all subordinate requests */
>> > > > > + bio_list_init(&hold);
>> > > > > + bio_list_merge(&hold, &bio_list_on_stack);
>> > > > > + bio_list_init(&bio_list_on_stack);
>> > > > >
>> > > > > ret = q->make_request_fn(q, bio);
>> > > > >
>> > > > > blk_queue_exit(q);
>> > > > > + /* sort new bios into those for a lower level
>> > > > > + * and those for the same level
>> > > > > + */
>> > > > > + bio_list_init(&lower);
>> > > > > + bio_list_init(&same);
>> > > > > + while ((bio = bio_list_pop(&bio_list_on_stack)) != NULL)
>> > > > > + if (q == bdev_get_queue(bio->bi_bdev))
>> > > > > + bio_list_add(&same, bio);
>> > > > > + else
>> > > > > + bio_list_add(&lower, bio);
>> > > > > + /* now assemble so we handle the lowest level first */
>> > > > > + bio_list_merge(&bio_list_on_stack, &lower);
>> > > > > + bio_list_merge(&bio_list_on_stack, &same);
>> > > > > + bio_list_merge(&bio_list_on_stack, &hold);
>> > > > >
>> > > > > bio = bio_list_pop(current->bio_list);
>> > > > > } else {
>> > > > > --
>> > > > > 2.7.4
>> > >
>> > > Mikulas, would you be willing to try the below patch with the
>> > > dm-snapshot deadlock scenario and report back on whether it fixes that?
>> > >
>> > > Patch below looks to be the same as here:
>> > > https://marc.info/?l=linux-raid&m=148232453107685&q=p3
>> > >
>> > > Neil and/or others if that isn't the patch that should be tested please
>> > > provide a pointer to the latest.
>> > >
>> > > Thanks,
>> > > Mike
>> >
>> > The bad news is that this doesn't fix the snapshot deadlock.
>> >
>> > I created a test program for the snapshot deadlock bug (it was originally
>> > created years ago to test for a different bug, so it contains some cruft).
>> > You also need to insert "if (ci->sector_count) msleep(100);" to the end of
>> > __split_and_process_non_flush to make the kernel sleep when splitting the
>> > bio.
>> >
>> > And with the above above patch, the snapshot deadlock bug still happens.
>
> That is really unfortunate. Would be useful to dig in and understand
> why. Because ordering of the IO in generic_make_request() really should
> take care of it.

I *think* you might be able to resolve this by changing
__split_and_process_bio() to only ever perform a single split. No
looping.
i.e. if the bio is too big to handle directly, then split off the front
to a new bio, which you bio_chain to the original. The original then
has bio_advance() called to stop over the front, then
generic_make_request() so it is queued.
Then the code proceeds to __clone_and_map_data_bio() on the front that
got split off.
When that completes it *doesn't* loop round, but returns into
generic_make_request() which does the looping and makes sure to handle
the lowest-level bio next.

something vaguely like this:

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 3086da5664f3..06ee0960e415 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1216,6 +1216,14 @@ static int __split_and_process_non_flush(struct clone_info *ci)

len = min_t(sector_t, max_io_len(ci->sector, ti), ci->sector_count);

+ if (len < ci->sector_count) {
+ struct bio *split = bio_split(bio, len, GFP_NOIO, fs_bio_set);
+ bio_chain(split, bio);
+ generic_make_request(bio);
+ bio = split;
+ ci->sector_count = len;
+ }
+
r = __clone_and_map_data_bio(ci, ti, ci->sector, &len);
if (r < 0)
return r;

though I haven't tested, and the change (if it works) should probably be
more fully integrated into surrounding code.

You probably don't want to use "fs_bio_set" either - a target-local
pool would be best.

NeilBrown

Attachment: signature.asc
Description: PGP signature