Re: blk-mq: takes hours for scsi scanning finish when thousands of LUNs

From: Jens Axboe
Date: Thu Oct 22 2015 - 12:06:22 EST

Next message: Williams, Dan J: "Re: [PATCH 5/5] block: enable dax for raw block devices"
Previous message: Namhyung Kim: "Re: [PATCH 2/2] perf tools: Improve call graph documents and help messages"
In reply to: Jeff Moyer: "Re: blk-mq: takes hours for scsi scanning finish when thousands of LUNs"
Next in thread: Jeff Moyer: "Re: blk-mq: takes hours for scsi scanning finish when thousands of LUNs"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 10/22/2015 09:53 AM, Jeff Moyer wrote:

Jens Axboe <axboe@xxxxxxxxx> writes:

I agree with the optimizing hot paths by cheaper percpu operation,
but how much does it affect the performance?

A lot, since the queue referencing happens twice per IO. The switch to
percpu was done to use shared/common code for this, the previous
version was a handrolled version of that.

as you know the switching causes delay, when the the LUN number is
increasing
the delay is becoming higher, so do you have any idea
about the problem?

Tejun already outlined a good solution to the problem:

"If percpu freezing is
happening during that, the right solution is moving finish_init to
late enough point so that percpu switching happens only after it's
known that the queue won't be abandoned."

I'm sure I'm missing something, but I don't think that will work.
blk_mq_update_tag_depth is freezing every single queue. Those queues
are already setup and will not go away. So how will moving finish_init
later in the queue setup fix this? The patch Jason provided most likely
works because __percpu_ref_switch_to_atomic doesn't do anything. The
most important things it doesn't do are:
1) percpu_ref_get(mq_usage_counter), followed by
2) call_rcu_sched()

It seems likely to me that forcing an rcu grace period for every single
LUN attached to a particular host is what's causing the delay.

And now you'll tell me how I've got that all wrong. ;-)

Haha, no I think that is absolutely right. We've seen these bugs a lot, having thousands of serialized rcu grace period waits, this is just one more. The patch that Jason sent just bypassed the percpu switch, which we can't do.

Anyway, I think what Jason had initially suggested, would work:

"if this thing must be done, as the code below shows just changing
flags depending on 'shared' variable why shouldn't we store the
previous result of 'shared' and compare with the current result, if
it's unchanged, nothing will be done and avoid looping all queues in
list."

I think that percolating BLK_MQ_F_TAG_SHARED up to the tag set would
allow newly created hctxs to simply inherit the shared state (in
blk_mq_init_hctx), and you won't need to freeze every queue in order to
guarantee that.

I was writing a patch to that effect. I've now stopped as I want to
make sure I'm not off in the weeds. :)

If that is where the delay is done, then yes, that should fix it and be a trivial patch.

--
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Williams, Dan J: "Re: [PATCH 5/5] block: enable dax for raw block devices"
Previous message: Namhyung Kim: "Re: [PATCH 2/2] perf tools: Improve call graph documents and help messages"
In reply to: Jeff Moyer: "Re: blk-mq: takes hours for scsi scanning finish when thousands of LUNs"
Next in thread: Jeff Moyer: "Re: blk-mq: takes hours for scsi scanning finish when thousands of LUNs"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]