Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle

From: Jens Axboe
Date: Fri Jan 19 2018 - 10:49:11 EST

Next message: tip-bot for zhenwei.pi: "[tip:x86/pti] x86/pti: Document fix wrong index"
Previous message: Andrea Arcangeli: "Re: [PATCH 23/35] x86/speculation: Add basic speculation control code"
In reply to: Ming Lei: "Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle"
Next in thread: Ming Lei: "Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 1/19/18 8:40 AM, Ming Lei wrote:
>>>> Where does the dm STS_RESOURCE error usually come from - what's exact
>>>> resource are we running out of?
>>>
>>> It is from blk_get_request(underlying queue), see
>>> multipath_clone_and_map().
>>
>> That's what I thought. So for a low queue depth underlying queue, it's
>> quite possible that this situation can happen. Two potential solutions
>> I see:
>>
>> 1) As described earlier in this thread, having a mechanism for being
>> notified when the scarce resource becomes available. It would not
>> be hard to tap into the existing sbitmap wait queue for that.
>>
>> 2) Have dm set BLK_MQ_F_BLOCKING and just sleep on the resource
>> allocation. I haven't read the dm code to know if this is a
>> possibility or not.
>>
>> I'd probably prefer #1. It's a classic case of trying to get the
>> request, and if it fails, add ourselves to the sbitmap tag wait
>> queue head, retry, and bail if that also fails. Connecting the
>> scarce resource and the consumer is the only way to really fix
>> this, without bogus arbitrary delays.
>
> Right, as I have replied to Bart, using mod_delayed_work_on() with
> returning BLK_STS_NO_DEV_RESOURCE(or sort of name) for the scarce
> resource should fix this issue.

It'll fix the forever stall, but it won't really fix it, as we'll slow
down the dm device by some random amount.

A simple test case would be to have a null_blk device with a queue depth
of one, and dm on top of that. Start a fio job that runs two jobs: one
that does IO to the underlying device, and one that does IO to the dm
device. If the job on the dm device runs substantially slower than the
one to the underlying device, then the problem isn't really fixed.

That said, I'm fine with ensuring that we make forward progress always
first, and then we can come up with a proper solution to the issue. The
forward progress guarantee will be needed for the more rare failure
cases, like allocation failures. nvme needs that too, for instance, for
the discard range struct allocation.

--
Jens Axboe

Next message: tip-bot for zhenwei.pi: "[tip:x86/pti] x86/pti: Document fix wrong index"
Previous message: Andrea Arcangeli: "Re: [PATCH 23/35] x86/speculation: Add basic speculation control code"
In reply to: Ming Lei: "Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle"
Next in thread: Ming Lei: "Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]