Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle

From: Bart Van Assche
Date: Thu Jan 18 2018 - 17:20:30 EST


On Thu, 2018-01-18 at 17:01 -0500, Mike Snitzer wrote:
> And yet Laurence cannot reproduce any such lockups with your test...

Hmm ... maybe I misunderstood Laurence but I don't think that Laurence has
already succeeded at running an unmodified version of my tests. In one of the
e-mails Laurence sent me this morning I read that he modified these scripts
to get past a kernel module unload failure that was reported while starting
these tests. So the next step is to check which changes were made to the test
scripts and also whether the test results are still valid.

> Are you absolutely certain this patch doesn't help you?
> https://patchwork.kernel.org/patch/10174037/
>
> If it doesn't then that is actually very useful to know.

The first I tried this morning is to run the srp-test software against a merge
of Jens' for-next branch and your dm-4.16 branch. Since I noticed that the dm
queue locked up I reinserted a blk_mq_delay_run_hw_queue() call in the dm code.
Since even that was not sufficient I tried to kick the queues via debugfs (for
s in /sys/kernel/debug/block/*/state; do echo kick >$s; done). Since that was
not sufficient to resolve the queue stall I reverted the following tree patches
that are in Jens' tree:
* "blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback"
* "blk-mq-sched: remove unused 'can_block' arg from blk_mq_sched_insert_request"
* "blk-mq: don't dispatch request in blk_mq_request_direct_issue if queue is busy"

Only after I had done this the srp-test software ran again without triggering
dm queue lockups. Sorry but I have not yet had the time to test patch "[RFC]
blk-mq: fixup RESTART when queue becomes idle".

> Please just focus on helping Laurence get his very capable testbed to
> reproduce this issue. Once we can reproduce these "unkillable" "stalls"
> in-house it'll be _much_ easier to analyze and fix.

OK, I will work with Laurence on this. Maybe Laurence and I should work on this
before analyzing the lockup that was mentioned above further?

Bart.