Re: [PATCH v2 2/2] blk-mq: Rerun dispatching in the case of budget contention

From: Doug Anderson
Date: Fri Apr 03 2020 - 11:50:13 EST


Hi,

On Fri, Apr 3, 2020 at 8:10 AM Doug Anderson <dianders@xxxxxxxxxxxx> wrote:
>
> Correct that it only happens with BFQ, but whether it's a BFQ bug or
> not just depends on how you define the has_work() API. If has_work()
> is allowed to be in-exact then it's either a blk-mq bug or a SCSI bug
> depending on how you cut it. If has_work() must be exact then it is
> certainly a BFQ bug. If has_work() doesn't need to be exact then it's
> not a BFQ bug. I believe that a sane API could be defined either way.
> Either has_work() can be defined as a lightweight hint to trigger
> heavier code or it can be defined as something exact. It's really up
> to blk-mq to say how they define it.
>
> From his response on the SCSI patch [1], it sounded like Jens was OK
> with has_work() being a lightweight hint as long as BFQ ensures that
> the queues run later. ...but, as my investigation found, I believe
> that BFQ _does_ try to ensure that the queue is run at a later time by
> calling blk_mq_run_hw_queues(). The irony is that due to the race
> we're talking about here, blk_mq_run_hw_queues() isn't guaranteed to
> be reliable if has_work() is inexact. :( One way to address this is
> to make blk_mq_run_hw_queues() reliable even if has_work() is inexact.
>
> ...so Jens: care to clarify how you'd like has_work() to be defined?

Sorry to reply so quickly after my prior reply, but I might have found
an extreme corner case where we can still run into the same race even
if has_work() is exact. This is all theoretical from code analysis.
Maybe you can poke a hole in my scenario or tell me it's so
implausible that we don't care, but it seems like it's theoretically
possible. For this example I'll assume a budget of 1 (AKA only one
thread can get budget for a given queue):

* Threads A and B both run has_work() at the same time with the same
"hctx". has_work() is exact but there's no lock, so it's OK if
Thread A and B both get back true.

* Thread B gets interrupted for a long time right after it decides
that there is work. Maybe its CPU gets an interrupt and the
interrupt handler is slow.

* Thread A runs, get budget, dispatches work.

* Thread A's work finishes and budget is released.

* Thread B finally runs again and gets budget.

* Since Thread A already took care of the work and no new work has
come in, Thread B will get NULL from dispatch_request(). I believe
this is specifically why dispatch_request() is allowed to return
NULL in the first place if has_work() must be exact.

* Thread B will now be holding the budget and is about to call
put_budget(), but hasn't called it yet.

* Thread B gets interrupted for a long time (again). Dang interrupts.

* Now Thread C (with a different "hctx" but the same queue) comes
along and runs blk_mq_do_dispatch_sched().

* Thread C won't do anything because it can't get budget.

* Finally Thread B will run again and put the budget without kicking
any queues.

Now we have a potential I/O stall because nobody will ever kick the
queues.


I think the above example could happen even on non-BFQ systems and I
think it would also be fixed by an approach like the one in my patch.

-Doug