Re: [PATCH] blk-mq: rationalize plug

From: Shaohua Li
Date: Wed Apr 29 2015 - 17:09:41 EST


On Mon, Apr 20, 2015 at 01:34:24PM -0400, Jeff Moyer wrote:
> Shaohua Li <shli@xxxxxx> writes:
>
> > Previous post of the patch is lost, so I repost. This is helpful, for
> > example, using scsi-mq for a sata drive.
> >
> > plug is still helpful for workload with IO merge, but it can be harmful
> > otherwise especially with multiple hardware queues, as there is (supposed) no
> > lock contention in this case and plug can introduce latency.
> >
> > For single queue, we always do plug. Reducing lock contention is still a win.
> > For multiple queues, we do a limited plug, eg plug only if there is merge. If a
> > request doesn't have merge with following request, the requet will be
> > dispatched immediately.
> >
> > An example workload here is fsync write a block device. Without plug
> > merge, sequential write (fsync makes it sync IO) will dispatch 4k IO.
>
> I like the general approach. I do wonder whether we should hold back a
> single I/O at all times, or whether we should do something similar to
> what Mike Snitzer did in the dm-mpath driver, where he keeps track of
> the end position of the last I/O (without holding the I/O back) to
> determine whether we're doing sequential I/O. With your approach, we
> will be able to get merges from the very start of a sequential I/O
> stream. With Mike's approach, you don't get merges until the second and
> third I/O. Mike's way you get potentially better latency for random I/O
> at the cost of some extra fields in a data structure. Something to
> think about.

Sorry for the later reply, been busy these days.

Not sure about this. If latency is sensitive, the caller really should
flush plug immediately.

> I have specfic comments inlined below. I think this patch would be
> better layered on top of my fix for single queue merging, but the two
> are basically independent, so I could go either way on that.
>
> > @@ -1230,8 +1269,12 @@ static void blk_mq_make_request(struct request_queue *q, struct bio *bio)
> > {
> > const int is_sync = rw_is_sync(bio->bi_rw);
> > const int is_flush_fua = bio->bi_rw & (REQ_FLUSH | REQ_FUA);
> > + unsigned int use_plug, request_count = 0;
> > struct blk_map_ctx data;
> > struct request *rq;
> > + struct blk_plug *plug;
> > +
> > + use_plug = !is_flush_fua;
>
> We can get rid of the use_plug variable. If is_flush_fua is set, then
> we'll never get to the point where it is tested. The patched code looks
> like this:
>
> use_plug = !is_flush_fua;
> ...
> if (unlikely(is_flush_fua)) {
> blk_mq_bio_to_request(rq, bio);
> blk_insert_flush(rq);
> goto run_queue;
> }
>
> See? Here we just skip out to run the queue. The check comes later:
>
> /*
> * we do limited pluging. If bio can be merged, do merge. Otherwise the
> * existing request in the plug list will be issued. So the plug list
> * will have one request at most
> */
> plug = current->plug;
> if (use_plug && plug) {
>
> So that if statement could be just 'if (plug)'.

Yep, this is more clear.

> > blk_queue_bounce(q, &bio);
> >
> > @@ -1240,6 +1283,10 @@ static void blk_mq_make_request(struct request_queue *q, struct bio *bio)
> > return;
> > }
> >
> > + if (use_plug && !blk_queue_nomerges(q) &&
> > + blk_attempt_plug_merge(q, bio, &request_count))
> > + return;
> > +
>
> You've just added merging for flush/fua requets. Was that intentional?
> We don't attempt them in the sq case, and we should probably be
> consistent.

FUA/FLUSH is in the unmerge bio flag list, so the merge will not happen.
But I agree checking it is faster.

> > rq = blk_mq_map_request(q, bio, &data);
>
> > if (unlikely(!rq))
> > return;
> > @@ -1251,37 +1298,34 @@ static void blk_mq_make_request(struct request_queue *q, struct bio *bio)
> > }
> >
> > /*
> > - * If the driver supports defer issued based on 'last', then
> > - * queue it up like normal since we can potentially save some
> > - * CPU this way.
> > + * we do limited pluging. If bio can be merged, do merge. Otherwise the
> > + * existing request in the plug list will be issued. So the plug list
> > + * will have one request at most
> > */
> > - if (is_sync && !(data.hctx->flags & BLK_MQ_F_DEFER_ISSUE)) {
> > - struct blk_mq_queue_data bd = {
> > - .rq = rq,
> > - .list = NULL,
> > - .last = 1
> > - };
> > - int ret;
> > + plug = current->plug;
> > + if (use_plug && plug) {
> > + struct request *old_rq = NULL;
> >
> > blk_mq_bio_to_request(rq, bio);
> > + if (!list_empty(&plug->mq_list)) {
> > + old_rq = list_first_entry(&plug->mq_list,
> > + struct request, queuelist);
> > + list_del_init(&old_rq->queuelist);
> > + }
> > + list_add_tail(&rq->queuelist, &plug->mq_list);
> > + blk_mq_put_ctx(data.ctx);
> > + if (!old_rq)
> > + return;
> > + if (!blk_mq_direct_issue_request(old_rq))
> > + return;
> > + blk_mq_insert_request(old_rq, false, true, true);
> > + return;
> > + }
> >
> > - /*
> > - * For OK queue, we are done. For error, kill it. Any other
> > - * error (busy), just add it to our list as we previously
> > - * would have done
> > - */
> > - ret = q->mq_ops->queue_rq(data.hctx, &bd);
> > - if (ret == BLK_MQ_RQ_QUEUE_OK)
> > + if (is_sync && !(data.hctx->flags & BLK_MQ_F_DEFER_ISSUE)) {
> > + blk_mq_bio_to_request(rq, bio);
> > + if (!blk_mq_direct_issue_request(rq))
>
> I think the BLK_MQ_F_DEFER_ISSUE suport isn't broken by this patch, but
> it's not entirely clear to me. You've essentially split out that
> particular code path into two different branches, where before it was
> much easier to follow. Prior to your patch, if the flag was set, we
> skip to the blk_mq_merge_queue_io call, bypassing the direct queue_rq.
> After your patch, blk_mq_insert_request might be called, or
> blk_mq_merge_queue_io might be. Like I said, I *think* the end result
> is correct, but I'm not completely sure. Jens, do you want to check
> that?

I'll try to make it more clear in next post.

> > @@ -1314,7 +1358,7 @@ static void blk_sq_make_request(struct request_queue *q, struct bio *bio)
> > * If we have multiple hardware queues, just go directly to
> > * one of those for sync IO.
> > */
> > - use_plug = !is_flush_fua && !is_sync;
> > + use_plug = !is_flush_fua;
> >
> > blk_queue_bounce(q, &bio);
>
> For this part I'd rather use my patch. Would you mind rebasing your
> patch on top of the sq plugging patch I submitted?

I'm ok, your patch looks better. I'll post some plug related patches out
soon and add your patch in them if you don't mind. If you'd like to post
by yourself, you can add my review in your patch:

Reviewed-by: Shaohua Li <shli@xxxxxx>

> One final question, sort of related to this patch. At the end of
> blk_mq_make_request, we have this:
>
> run_queue:
> blk_mq_run_hw_queue(data.hctx, !is_sync || is_flush_fua);
>
> So, we don't run the queue synchronously for flush/fua. Is that logic
> right? Why don't we want to push that out immediately?

I'm not sure. My theory is delaying flush/fua can increase the chance
the flush logic merges multiple flush into one flush, please see the
logic of flush handling. But this is just my thought, might be
completely wrong.

Thanks,
Shaohua
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/