On Thu, Aug 20, 2020 at 3:22 AM Keith Busch <kbusch@xxxxxxxxxx> wrote:
On Wed, Aug 19, 2020 at 01:11:58PM -0600, David Fugate wrote:
> Intel does not support making *optional* NVMe spec features *required*
> by the NVMe driver.
This is inaccurate. As another example, the spec optionally allows a
zone size to be a power of 2, but linux requires a power of 2 if you
want to use it with this driver.
> Provided there's no glaring technical issues
There are many. Some may be improved through a serious review process,
but the mess it makes out of the fast path is measurably harmful to
devices that don't subscribe to this. That issue is not so easily
remedied.
Since this patch is a copy of the scsi implementation, the reasonable
thing is take this fight to the generic block layer for a common
solution. We're not taking this in the nvme driver.
I sincerely want to minimize any adverse impact to the fast-path of
non-zoned devices.
My understanding of that aspect is (I feel it is good to confirm,
irrespective of the future of this patch):
1. Submission path:
There is no extra code for non-zoned device IO. For append, there is
this "ns->append_emulate = 1" check.
Snippet -
case REQ_OP_ZONE_APPEND:
- ret = nvme_setup_rw(ns, req, cmd, nvme_cmd_zone_append);
+ if (!nvme_is_append_emulated(ns))
+ ret = nvme_setup_rw(ns, req, cmd, nvme_cmd_zone_append);
+ else {
+ /* prepare append like write, and adjust lba
afterwards */
2. Completion:
Not as clean as submission for sure.
The extra check is "ns && ns->append_emulate == 1" for completions if
CONFIG_ZONED is enabled.
A slightly better completion-handling version (than the submitted patch) is -
- } else if (IS_ENABLED(CONFIG_BLK_DEV_ZONED) &&
- req_op(req) == REQ_OP_ZONE_APPEND) {
- req->__sector = nvme_lba_to_sect(req->q->queuedata,
- le64_to_cpu(nvme_req(req)->result.u64));
+ } else if (IS_ENABLED(CONFIG_BLK_DEV_ZONED)) {
+ struct nvme_ns *ns = req->q->queuedata;
+ /* append-emulation requires wp update for some cmds*/
+ if (ns && nvme_is_append_emulated(ns)) {
+ if (nvme_need_zone_wp_update(req))
+ nvme_zone_wp_update(ns, req, status);
+ }
+ else if (req_op(req) == REQ_OP_ZONE_APPEND)
+ req->__sector = nvme_lba_to_sect(ns,
+ le64_to_cpu(nvme_req(req)->result.u64));
Am I missing any other fast-path pain-points?
A quick 1 minute 4K randwrite run (QD 64, 4 jobs,libaio) shows :
before: IOPS=270k, BW=1056MiB/s (1107MB/s)(61.9GiB/60002msec)
after: IOPS=270k, BW=1055MiB/s (1106MB/s)(61.8GiB/60005msec)
This maynot be the best test to see the cost, and I am happy to
conduct more and improvise.
As for the volume of the code - it is same as SCSI emulation. And I
can make efforts to reduce that by moving common code to blk-zone, and
reuse in SCSI/NVMe emulation.
In the patch I tried to isolate append-emulation by keeping everything
into "nvme_za_emul". It contains nothing nvme'ish except nvme_ns,
which can be turned into "driver_data".
+#ifdef CONFIG_BLK_DEV_ZONED
+struct nvme_za_emul {
+ unsigned int nr_zones;
+ spinlock_t zones_wp_offset_lock;
+ u32 *zones_wp_offset;
+ u32 *rev_wp_offset;
+ struct work_struct zone_wp_offset_work;
+ char *zone_wp_update_buf;
+ struct mutex rev_mutex;
+ struct nvme_ns *ns;
+};
+#endif
Will that be an acceptable line of further work/discussions?