Re: [PATCH v2 1/1] block: nvme-core: Scatter gather list support inthe NVMe Block Driver

From: Matthew Wilcox
Date: Tue Dec 17 2013 - 10:08:31 EST


On Tue, Dec 17, 2013 at 05:55:14PM +0530, Rajiv Shanmugam Madeswaran wrote:
> Hi matthew, Thanks for your suggestion. I have made the following changes in the block driver.
>
> 1. As far as SGL is concerned I could see slender increase in the iops when compared to PRP. For Ex: consider a scatter
> gather list point to a segment which is contiguous and it is multiples of 4k.
> In case of PRP it may take more than one entry for an individual IO transfer. As each of the PRP entry point to 4k size.
> In case of SGL it could be formed as a single data block descriptor. In that case the processing logic of SGL becomes
> simpler than PRP. But this kind of entry from the block layer is minimal. so,SGL can perform well at such scenarios.

You're speculating, or you've measured? Processing an SGL is more
complex for the device than processing a PRP list. I would expect a
device to be able to process more IOPS using PRPs than SGLs.

I can see SGLs being faster for:

- The case where we would currently send more than one command (ie we hit
the nvme_split_and_submit() case in nvme_map_bio(), which is the case
when a buffer is not virtually contiguous)
- The case where we can describe the user buffer using a single SGL but would
require more than 2 PRPs (so physically contiguous, but crosses more than
one page boundary). In this case, we can describe everything with a single
command and we don't need a PRP list.
- There is probably a case when the SGL is significantly shorter than
the equivalent PRP list. That's going to vary depending on the drive.

Really though, it needs someone with some hardware to measure before I'm
going to take a patch to enable SGLs. And I'd dearly love to see macro
benchmarks (eg a kernel compile or a MySQL run) more than "I did this dd
operation and ..."

> 2. I have handled the driver to work with multiple controllers by adding following parameters in nvme_dev structure.
>
> /* following function pointer sets the type of data transfer
> + * based on the controller support.it may be either SGL or PRP.
> + * This function pointers are assigned during initialization.
> + */
> +
> + struct nvme_iod *
> + (*alloc_iod)(unsigned nseg, unsigned nbytes, gfp_t gfp);
> +
> + int (*setup_xfer)(struct nvme_dev *dev, struct nvme_common_command *cmd,
> + struct nvme_iod *iod, int total_len, gfp_t gfp);
> +
> + void (*release_iod)(struct nvme_dev *dev, struct nvme_iod *iod);
>
> So, depending on the driver instance the corresponding mode of transfer will be called during IO.

As I've said, I think this decision needs to be taken on a per-command
basis, not a per-device basis.

> + dev->sgls = le32_to_cpup(&ctrl->sgls);
> +
> + if (use_sgl) {
> + if (NVME_CTRL_SUPP_SGL(dev->sgls)) {
> + dev->alloc_iod = nvme_alloc_iod_sgl;
> + dev->setup_xfer = nvme_setup_sgls;
> + dev->release_iod = nvme_free_iod_sgl;
> + } else
> + goto enable_prp;
> + } else {
> + enable_prp:
> + dev->alloc_iod = nvme_alloc_iod;
> + dev->setup_xfer = nvme_setup_prps;
> + dev->release_iod = nvme_free_iod;
> + }

Why not simply:

+ if (use_sgl && NVME_CTRL_SUPP_SGL(dev->sgls)) {
+ dev->alloc_iod = nvme_alloc_iod_sgl;
...
+ } else {
...

> @@ -68,7 +68,9 @@ struct nvme_id_ctrl {
> __u8 vwc;
> __le16 awun;
> __le16 awupf;
> - __u8 rsvd530[1518];
> + __u8 rsvd530[6];
> + __le32 sgls;
> + __u8 rsvd540[1518];

That can't be right. Maybe it's 1508 bytes long?

> struct nvme_id_power_state psd[32];
> __u8 vs[1024];
> };
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/