Re: [RFC] block: fix blk_queue_split() resource exhaustion

From: Ming Lei
Date: Sat Jul 02 2016 - 06:03:34 EST


On Tue, Jun 28, 2016 at 4:45 PM, Lars Ellenberg
<lars.ellenberg@xxxxxxxxxx> wrote:
> On Sat, Jun 25, 2016 at 05:30:29PM +0800, Ming Lei wrote:
>> On Fri, Jun 24, 2016 at 10:27 PM, Lars Ellenberg
>> <lars.ellenberg@xxxxxxxxxx> wrote:
>> > On Fri, Jun 24, 2016 at 07:36:57PM +0800, Ming Lei wrote:
>> >> >
>> >> > This is not a theoretical problem.
>> >> > At least int DRBD, and an unfortunately high IO concurrency wrt. the
>> >> > "max-buffers" setting, without this patch we have a reproducible deadlock.
>> >>
>> >> Is there any log about the deadlock? And is there any lockdep warning
>> >> if it is enabled?
>> >
>> > In DRBD, to avoid potentially very long internal queues as we wait for
>> > our replication peer device and local backend, we limit the number of
>> > in-flight bios we accept, and block in our ->make_request_fn() if that
>> > number exceeds a configured watermark ("max-buffers").
>> >
>> > Works fine, as long as we could assume that once our make_request_fn()
>> > returns, any bios we "recursively" submitted against the local backend
>> > would be dispatched. Which used to be the case.
>> >
>> > commit 54efd50 block: make generic_make_request handle arbitrarily sized bios
>> > introduced blk_queue_split(), which will split any bio that is
>> > violating the queue limits into a smaller piece to be processed
>> > right away, and queue the "rest" on current->bio_list to be
>> > processed by the iteration in generic_make_request() once the
>> > current ->make_request_fn() returns.
>> >
>> > Before that, any user was supposed to go through bio_add_page(),
>> > which would call our merge bvec function, and that should already
>> > be sufficient to not violate queue limits.
>> >
>> > Previously, typically the next in line bio to be processed would
>> > be the cloned one we passed down to our backend device, which in
>> > case of further stacking devices (e.g. logical volume, raid1 or
>> > similar) may again push further bios on that list.
>> > All of which would simply be processed in turn.
>>
>> Could you clarify if the issue is triggered in drbd without dm/md involved?
>> Or the issue is always triggered with dm/md over drbd?
>>
>> As Mike mentioned, there is one known issue with dm-snapshot.
>
>
> The issue can always be triggered, even if only DRBD is involved.
>
>> If your ->make_request_fn() is called several times, each time
>> at least you should dispatch one bio returnd from blk_queue_split(),
>> so I don't understand why even one bio isn't dispatched out.
>
> I'll try to "visualize" the mechanics of "my" deadlock here.
>
> Just to clarify the effect, assume we had a driver that
> for $reasons would limit the number of in-flight IO to
> one single bio.
>
> === before bio_queue_split()
>
> Previously, if something would want to read some range of blocks,
> it would allocate a bio, call bio_add_page() a number of times,
> and once the bio was "full", call generic_make_request(),
> and fill the next bio, then submit that one.
>
> Stacking: "single_in_flight" (q1) -> "sda" (q2)
>
> generic_make_request(bio) current->bio_list in-flight
> B_orig_1 NULL 0
> q1->make_request_fn(B_orig_1) empty
> 1
> recursive call, queues: B_1_remapped
> iterative call:
> q2->make_request_fn(B_1_remapped) empty
> -> actually dispatched to hardware
> return of generic_make_request(B_orig_1).
> B_orig_2
> q1->make_request_fn(B_orig_1)
> 1
> blocks, waits for in-flight to drop
> ...
> completion of B_orig_1 0
>
> recursive call, queues: B_2_remapped
> iterative call:
> q2->make_request_fn(B_2_remapped) empty
> -> actually dispatched to hardware
> return of generic_make_request(B_orig_2).
>
>
> === with bio_queue_split()
>
> Now, uppser layers buils one large bio.
>
> generic_make_request(bio) current->bio_list in-flight
> B_orig NULL 0
> q1->make_request_fn(B_orig) empty
> blk_queue_split(B_orig) splits into
> B_orig_r1
> B_orig_s1
> 1
> recursive call, queues: B_orig_r1, B_s1_remapped
> iterative call:
> q1->make_request_fn(B_orig_r1) B_s1_remapped
> blocks, waits for in-flight to drop
> ...
> which never happens,
> because B_s1_remapped has not even been submitted to q2 yet,
> let alone dispatched to hardware.
>
>
> Obviously we don't limit ourselves to just one request, but with larger
> incoming bios, with the number of times we split them, with the number
> of stacking layers below us, or even layers below us that *also*
> call blk_queue_split (or equivalent open coded clone and split)
> themselves to split even further, things get worse.

Thanks for your so detailed explanation!

Now I believe the issue is really from blk_queue_split(), and I will comment
on your patch later.

thanks,