Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle
From: Mike Snitzer
Date: Thu Jan 18 2018 - 15:49:18 EST
On Thu, Jan 18 2018 at 3:11pm -0500,
Jens Axboe <axboe@xxxxxxxxx> wrote:
> On 1/18/18 11:47 AM, Bart Van Assche wrote:
> >> This is all very tiresome.
> >
> > Yes, this is tiresome. It is very annoying to me that others keep
> > introducing so many regressions in such important parts of the kernel.
> > It is also annoying to me that I get blamed if I report a regression
> > instead of seeing that the regression gets fixed.
>
> I agree, it sucks that any change there introduces the regression. I'm
> fine with doing the delay insert again until a new patch is proven to be
> better.
>
> From the original topic of this email, we have conditions that can cause
> the driver to not be able to submit an IO. A set of those conditions can
> only happen if IO is in flight, and those cases we have covered just
> fine. Another set can potentially trigger without IO being in flight.
> These are cases where a non-device resource is unavailable at the time
> of submission. This might be iommu running out of space, for instance,
> or it might be a memory allocation of some sort. For these cases, we
> don't get any notification when the shortage clears. All we can do is
> ensure that we restart operations at some point in the future. We're SOL
> at that point, but we have to ensure that we make forward progress.
>
> That last set of conditions better not be a a common occurence, since
> performance is down the toilet at that point. I don't want to introduce
> hot path code to rectify it. Have the driver return if that happens in a
> way that is DIFFERENT from needing a normal restart. The driver knows if
> this is a resource that will become available when IO completes on this
> device or not. If we get that return, we have a generic run-again delay.
>
> This basically becomes the same as doing the delay queue thing from DM,
> but just in a generic fashion.
This is a bit confusing for me (as I see it we have 2 blk-mq drivers
trying to collaborate, so your refering to "driver" lacks precision; but
I could just be missing something)...
For Bart's test the underlying scsi-mq driver is what is regularly
hitting this case in __blk_mq_try_issue_directly():
if (blk_mq_hctx_stopped(hctx) || blk_queue_quiesced(q))
It certainly better not be the norm (Bart's test hammering on this aside).
For starters, it'd be very useful to know if Bart is hitting the
blk_mq_hctx_stopped() or blk_queue_quiesced() for this case that is
triggering the use of blk_mq_sched_insert_request() -- I'd wager it is
due to blk_queue_quiesced() but Bart _please_ try to figure it out.
Anyway, in response to this case Bart would like the upper layer dm-mq
driver to blk_mq_delay_run_hw_queue(). Certainly is quite the hammer.
But that hammer aside, in general for this case, I'm concerned about: is
it really correct to allow an already stopped/quiesced underlying queue
to retain responsibility for processing the request? Or would the
upper-layer dm-mq benefit from being able to retry the request on its
terms (via a "DIFFERENT" return from blk-mq core)?
Like this? The (ab)use of BLK_STS_DM_REQUEUE certainly seems fitting in
this case but...
(Bart please note that this patch applies on linux-dm.git's 'for-next';
which is just a merge of Jens' 4.16 tree and dm-4.16)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 74a4f237ba91..371a1b97bf56 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1781,16 +1781,11 @@ static blk_status_t __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
struct request_queue *q = rq->q;
bool run_queue = true;
- /*
- * RCU or SRCU read lock is needed before checking quiesced flag.
- *
- * When queue is stopped or quiesced, ignore 'bypass_insert' from
- * blk_mq_request_direct_issue(), and return BLK_STS_OK to caller,
- * and avoid driver to try to dispatch again.
- */
+ /* RCU or SRCU read lock is needed before checking quiesced flag */
if (blk_mq_hctx_stopped(hctx) || blk_queue_quiesced(q)) {
run_queue = false;
- bypass_insert = false;
+ if (bypass_insert)
+ return BLK_STS_DM_REQUEUE;
goto insert;
}
diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
index d8519ddd7e1a..2f554ea485c3 100644
--- a/drivers/md/dm-rq.c
+++ b/drivers/md/dm-rq.c
@@ -408,7 +408,7 @@ static blk_status_t dm_dispatch_clone_request(struct request *clone, struct requ
clone->start_time = jiffies;
r = blk_insert_cloned_request(clone->q, clone);
- if (r != BLK_STS_OK && r != BLK_STS_RESOURCE)
+ if (r != BLK_STS_OK && r != BLK_STS_RESOURCE && r != BLK_STS_DM_REQUEUE)
/* must complete clone in terms of original request */
dm_complete_request(rq, r);
return r;
@@ -472,6 +472,7 @@ static void init_tio(struct dm_rq_target_io *tio, struct request *rq,
* Returns:
* DM_MAPIO_* : the request has been processed as indicated
* DM_MAPIO_REQUEUE : the original request needs to be immediately requeued
+ * DM_MAPIO_DELAY_REQUEUE : the original request needs to be requeued after delay
* < 0 : the request was completed due to failure
*/
static int map_request(struct dm_rq_target_io *tio)
@@ -500,11 +501,11 @@ static int map_request(struct dm_rq_target_io *tio)
trace_block_rq_remap(clone->q, clone, disk_devt(dm_disk(md)),
blk_rq_pos(rq));
ret = dm_dispatch_clone_request(clone, rq);
- if (ret == BLK_STS_RESOURCE) {
+ if (ret == BLK_STS_RESOURCE || ret == BLK_STS_DM_REQUEUE) {
blk_rq_unprep_clone(clone);
tio->ti->type->release_clone_rq(clone);
tio->clone = NULL;
- if (!rq->q->mq_ops)
+ if (ret == BLK_STS_DM_REQUEUE || !rq->q->mq_ops)
r = DM_MAPIO_DELAY_REQUEUE;
else
r = DM_MAPIO_REQUEUE;
@@ -741,6 +742,7 @@ static int dm_mq_init_request(struct blk_mq_tag_set *set, struct request *rq,
static blk_status_t dm_mq_queue_rq(struct blk_mq_hw_ctx *hctx,
const struct blk_mq_queue_data *bd)
{
+ int r;
struct request *rq = bd->rq;
struct dm_rq_target_io *tio = blk_mq_rq_to_pdu(rq);
struct mapped_device *md = tio->md;
@@ -768,10 +770,13 @@ static blk_status_t dm_mq_queue_rq(struct blk_mq_hw_ctx *hctx,
tio->ti = ti;
/* Direct call is fine since .queue_rq allows allocations */
- if (map_request(tio) == DM_MAPIO_REQUEUE) {
+ r = map_request(tio);
+ if (r == DM_MAPIO_REQUEUE || r == DM_MAPIO_DELAY_REQUEUE) {
/* Undo dm_start_request() before requeuing */
rq_end_stats(md, rq);
rq_completed(md, rq_data_dir(rq), false);
+ if (r == DM_MAPIO_DELAY_REQUEUE)
+ blk_mq_delay_run_hw_queue(hctx, 100/*ms*/);
return BLK_STS_RESOURCE;
}