Re: [PATCH V3] NVMe: basic conversion to blk-mq

From: Jens Axboe
Date: Thu May 29 2014 - 10:26:07 EST

Next message: Frederic Weisbecker: "Re: vmstat: On demand vmstat workers V5"
Previous message: Will Deacon: "[GIT PULL] Late regression fix for arm64"
In reply to: Keith Busch: "Re: [PATCH V3] NVMe: basic conversion to blk-mq"
Next in thread: Jens Axboe: "Re: [PATCH V3] NVMe: basic conversion to blk-mq"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 2014-05-28 21:07, Keith Busch wrote:

On Wed, 28 May 2014, Matias BjÃrling wrote:

This converts the current NVMe driver to utilize the blk-mq layer.

I am concerned about device hot removal since the h/w queues can be
freed at any time. I *think* blk-mq helps with this in that the driver
will not see a new request after calling blk_cleanup_queue. If you can
confirm that's true and that blk-mq waits for all requests active in the
driver to return to the block layer, then we're probably okay in this
path. That wasn't true as a bio based driver which is why we are cautious
with all the locking and barriers. But what about the IOCTL paths?

Barring any bugs in the code, then yes, this should work. On the scsi-mq side, extensive error injection and pulling has been done, and it seems to hold up fine now. The ioctl path would need to be audited.

It also doesn't look like we're handling the case where the SGL can't
map to a PRP.

+static void req_completion(struct nvme_queue *nvmeq, void *ctx,
struct nvme_completion *cqe)
{
struct nvme_iod *iod = ctx;
- struct bio *bio = iod->private;
+ struct request *req = iod->private;
+
u16 status = le16_to_cpup(&cqe->status) >> 1;

- if (unlikely(status)) {
- if (!(status & NVME_SC_DNR ||
- bio->bi_rw & REQ_FAILFAST_MASK) &&
- (jiffies - iod->start_time) < IOD_TIMEOUT) {
- if (!waitqueue_active(&nvmeq->sq_full))
- add_wait_queue(&nvmeq->sq_full,
- &nvmeq->sq_cong_wait);
- list_add_tail(&iod->node, &nvmeq->iod_bio);
- wake_up(&nvmeq->sq_full);
- return;
- }
- }
if (iod->nents) {
- dma_unmap_sg(nvmeq->q_dmadev, iod->sg, iod->nents,
- bio_data_dir(bio) ? DMA_TO_DEVICE : DMA_FROM_DEVICE);
- nvme_end_io_acct(bio, iod->start_time);
+ dma_unmap_sg(&nvmeq->dev->pci_dev->dev, iod->sg, iod->nents,
+ rq_data_dir(req) ? DMA_TO_DEVICE : DMA_FROM_DEVICE);
}
nvme_free_iod(nvmeq->dev, iod);
- if (status)
- bio_endio(bio, -EIO);
+
+ if (unlikely(status))
+ req->errors = -EIO;
else
- bio_endio(bio, 0);
+ req->errors = 0;
+
+ blk_mq_complete_request(req);
}

Is blk-mq going to retry intermittently failed commands for me? It
doesn't look like it will.

Not sure what kind of behavior you are looking for here. If you can expand on the above a bit, I'll gladly help sort it out. Only the driver really knows if a particular request should be failed hard or retried. So you'd probably have to track retry counts in the request and reinsert/end as appropriate.

+static int nvme_submit_flush_sync(struct nvme_queue *nvmeq, struct
nvme_ns *ns)
+{
+ struct request *req;
+ struct nvme_command cmnd;
+
+ req = blk_mq_alloc_request(ns->queue, WRITE, GFP_KERNEL, false);
+ if (!req)
+ return -ENOMEM;
+
+ nvme_setup_flush(&cmnd, ns, req->tag);
+ nvme_submit_sync_cmd(req, &cmnd, NULL, NVME_IO_TIMEOUT);

return 0;
}

It looks like this function above is being called from an interrupt
context where we are already holding a spinlock. The sync command will
try to take that same lock.

Yes, that code still looks very buggy. The initial alloc for flush_cmd_info should also retry, not fail hard, if that alloc fails. For the reinsert part, Matias, you want to look at the flush code in blk-mq and how that handles it.

+ if ((req->cmd_flags & REQ_FLUSH) && psegs) {
+ struct flush_cmd_info *flush_cmd = kmalloc(
+ sizeof(struct flush_cmd_info), GFP_KERNEL);

The comment above says "may not sleep", but using GFP_KERNEL here. I
actually think it is safe to sleep, though since you're not taking a
lock until later, so maybe you can change all the allocs to GFP_KERNEL?

It might be safe, and it might not be. It depends on the CPU mapping to the queue. preemption _could_ be off here (if needed), so GFP_KERNEL cannot be used. Additionally, see above comment on what to do on alloc failure.

--
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Frederic Weisbecker: "Re: vmstat: On demand vmstat workers V5"
Previous message: Will Deacon: "[GIT PULL] Late regression fix for arm64"
In reply to: Keith Busch: "Re: [PATCH V3] NVMe: basic conversion to blk-mq"
Next in thread: Jens Axboe: "Re: [PATCH V3] NVMe: basic conversion to blk-mq"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]