Re: [PATCH v2 2/2] blk-mq: Rerun dispatching in the case of budget contention

From: Ming Lei
Date: Thu Apr 02 2020 - 21:34:27 EST


On Thu, Apr 02, 2020 at 08:51:30AM -0700, Douglas Anderson wrote:
> It is possible for two threads to be running
> blk_mq_do_dispatch_sched() at the same time with the same "hctx".
> This is because there can be more than one caller to
> __blk_mq_run_hw_queue() with the same "hctx" and hctx_lock() doesn't
> prevent more than one thread from entering.
>
> If more than one thread is running blk_mq_do_dispatch_sched() at the
> same time with the same "hctx", they may have contention acquiring
> budget. The blk_mq_get_dispatch_budget() can eventually translate
> into scsi_mq_get_budget(). If the device's "queue_depth" is 1 (not
> uncommon) then only one of the two threads will be the one to
> increment "device_busy" to 1 and get the budget.
>
> The losing thread will break out of blk_mq_do_dispatch_sched() and
> will stop dispatching requests. The assumption is that when more
> budget is available later (when existing transactions finish) the
> queue will be kicked again, perhaps in scsi_end_request().
>
> The winning thread now has budget and can go on to call
> dispatch_request(). If dispatch_request() returns NULL here then we
> have a potential problem. Specifically we'll now call

As I mentioned before, it is a BFQ specific issue, it tells blk-mq
that it has work to do, and now the budget is assigned to the only
winning thread, however, dispatch_request() still returns NULL.

> blk_mq_put_dispatch_budget() which translates into
> scsi_mq_put_budget(). That will mark the device as no longer busy but
> doesn't do anything to kick the queue. This violates the assumption
> that the queue would be kicked when more budget was available.

The queue is still kicked in by BFQ in its idle timer, however that
timer doesn't make forward progress.

Without this idle timer, it is simply a BFQ issue.

>
> Pictorially:
>
> Thread A Thread B
> ================================= ==================================
> blk_mq_get_dispatch_budget() => 1
> dispatch_request() => NULL
> blk_mq_get_dispatch_budget() => 0
> // because Thread A marked
> // "device_busy" in scsi_device
> blk_mq_put_dispatch_budget()

What if there is only thread A? You need to mention that thread B
is from BFQ.

>
> The above case was observed in reboot tests and caused a task to hang
> forever waiting for IO to complete. Traces showed that in fact two
> tasks were running blk_mq_do_dispatch_sched() at the same time with
> the same "hctx". The task that got the budget did in fact see
> dispatch_request() return NULL. Both tasks returned and the system
> went on for several minutes (until the hung task delay kicked in)
> without the given "hctx" showing up again in traces.
>
> Let's attempt to fix this problem by detecting if there was contention
> for the budget in the case where we put the budget without dispatching
> anything. If we saw contention we kick all hctx's associated with the
> queue where there was contention. We do this without any locking by
> adding a double-check for budget and accepting a small amount of faux
> contention if the 2nd check gives us budget but then we don't dispatch
> anything (we'll look like we contended with ourselves).
>
> A few extra notes:
>
> - This whole thing is only a problem due to the inexact nature of
> has_work(). Specifically if has_work() always guaranteed that a
> "true" return meant that dispatch_request() would return non-NULL
> then we wouldn't have this problem. That's because we only grab the
> budget if has_work() returned true. If we had the non-NULL
> guarantee then at least one of the threads would actually dispatch
> work and when the work was done then queues would be kicked
> normally.
>
> - One specific I/O scheduler that trips this problem quite a bit is
> BFQ which definitely returns "true" for has_work() in cases where it
> wouldn't dispatch. Making BFQ's has_work() more exact requires that
> has_work() becomes a much heavier function, including figuring out
> how to acquire spinlocks in has_work() without tripping circular
> lock dependencies. This is prototyped but it's unclear if it's
> really the way to go when the problem can be solved with a
> relatively lightweight contention detection mechanism.
>
> - Because this problem only trips with inexact has_work() it's
> believed that blk_mq_do_dispatch_ctx() does not need a similar
> change.

Right, I prefer to fix it in BFQ, given it isn't a generic issue,
not worth of generic solution.

>
> Signed-off-by: Douglas Anderson <dianders@xxxxxxxxxxxx>
> ---
>
> Changes in v2:
> - Replace ("scsi: core: Fix stall...") w/ ("blk-mq: Rerun dispatch...")
>
> block/blk-mq-sched.c | 26 ++++++++++++++++++++++++--
> block/blk-mq.c | 3 +++
> include/linux/blkdev.h | 2 ++
> 3 files changed, 29 insertions(+), 2 deletions(-)
>
> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> index 74cedea56034..0195d75f5f96 100644
> --- a/block/blk-mq-sched.c
> +++ b/block/blk-mq-sched.c
> @@ -97,12 +97,34 @@ static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
> if (e->type->ops.has_work && !e->type->ops.has_work(hctx))
> break;
>
> - if (!blk_mq_get_dispatch_budget(hctx))
> - break;
> +
> + if (!blk_mq_get_dispatch_budget(hctx)) {
> + /*
> + * We didn't get budget so set contention. If
> + * someone else had the budget but didn't dispatch
> + * they'll kick everything. NOTE: we check one
> + * extra time _after_ setting contention to fully
> + * close the race. If we don't actually dispatch
> + * we'll detext faux contention (with ourselves)
> + * but that should be rare.
> + */
> + atomic_set(&q->budget_contention, 1);
> +
> + if (!blk_mq_get_dispatch_budget(hctx))

scsi_mq_get_budget() implies a smp_mb(), so the barrier can order
between blk_mq_get_dispatch_budget() and atomic_set(&q->budget_contention, 0|1).

> + break;
> + }
>
> rq = e->type->ops.dispatch_request(hctx);
> if (!rq) {
> blk_mq_put_dispatch_budget(hctx);
> +
> + /*
> + * We've released the budget but us holding it might
> + * have prevented someone else from dispatching.
> + * Detect that case and run all queues again.
> + */
> + if (atomic_read(&q->budget_contention))

scsi_mq_put_budget() doesn't imply smp_mb(), so one smp_mb__after_atomic()
is needed between the above two op.

> + blk_mq_run_hw_queues(q, true);
> break;
> }
>
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 2cd8d2b49ff4..6163c43ceca5 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1528,6 +1528,9 @@ void blk_mq_run_hw_queues(struct request_queue *q, bool async)
> struct blk_mq_hw_ctx *hctx;
> int i;
>
> + /* We're running the queues, so clear the contention detector */
> + atomic_set(&q->budget_contention, 0);
> +

You add extra cost in fast path.

> queue_for_each_hw_ctx(q, hctx, i) {
> if (blk_mq_hctx_stopped(hctx))
> continue;
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index f629d40c645c..07f21e45d993 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -583,6 +583,8 @@ struct request_queue {
>
> #define BLK_MAX_WRITE_HINTS 5
> u64 write_hints[BLK_MAX_WRITE_HINTS];
> +
> + atomic_t budget_contention;

It needn't to be a atomic variable, and simple 'unsigned'
int should be fine, what matters is that the order between
R/W this flag and get/put budget.


thanks,
Ming