Re: NVMe over Fabrics target implementation
From: Nicholas A. Bellinger
Date: Wed Jun 08 2016 - 01:21:48 EST
On Tue, 2016-06-07 at 12:55 +0200, Christoph Hellwig wrote:
> There is absolutely no point in dragging in an overcomplicated configfs
> structure for a very simple protocol which also is very different from
> SCSI in it's nitty gritty details.
Please be more specific wrt the two individual points that have been
raised.
> Keeping the nvme target self contains
> allows it to be both much simpler and much easier to understand, as well
> as much better testable - see the amount of test coverage we could easily
> add for example.
I disagree.
>
> Or to put it the other way around - if there was any major synergy in
> reusing the SCSI target code that just shows we're missing functionality
> in the block layer or configfs.
>
To reiterate the points again.
*) Extensible to multiple types of backend drivers.
nvme-target needs a way to absorb new backend drivers, that
does not effect existing configfs group layout or attributes.
Looking at the nvmet/configfs layout as-is, there are no multiple
backend types defined, nor a way to control backend feature bits
exposed to nvme namespaces at runtime.
What is being proposed is a way to share target-core backends via
existing configfs symlinks across SCSI and NVMe targets.
Which means:
- All I/O state + memory submission is done at RCU protected
se_device level via sbc_ops
- percpu reference counting is done outside of target-core
- Absorb all nvmet/io-cmd optimizations into target_core_iblock.c
- Base starting point for features in SCSI + NVMe that span
across multiple endpoints and instances (reservations + APTPL,
multipath, copy-offload across fabric types)
Using target-core backends means we get features like T10-PI and
sbc_ops->write_same for free that don't exist in nvmet, and can
utilize a common set of backend drivers for SCSI and NVMe via an
existing configfs ABI and python userspace community.
And to the second, and more important point for defining a configfs ABI
that works for both today's requirements, as well into the 2020s
without breaking user-space compatibility.
As-is, the initial design using top level nvmet configfs symlinks of
subsystem groups into individual port + host groups does not scale.
That is, it currently does:
- Sequential list lookup under global rw_mutex of top-level nvmet_port
and nvmet_host symlink ->allow_link() and ->drop_link() configfs
callbacks.
- nvmet_fabrics_ops->add_port() callback invoked under same global
rw mutex.
This is very bad for several reasons.
As-is, this blocks all other configfs port + host operations from
occurring even during normal operation, which makes it quite useless for
any type of multi-tenant target environment where the individual target
endpoints *must* be able to operate independently.
Seriously, there is never a good reason why configfs group or item
callbacks should be performing list lookup under a global lock at
this level.
Why does it ever make sense for $SUBSYSTEM_NQN_0 with $PORT_DRIVER_FOO
to block operation of $SUBSYSTEM_NQN_1 with $PORT_DRIVER_BAR..?
A simple example where this design breaks down quickly is a NVMf
ops->add_port() call that requires a HW reset, or say reloading of
firmware that can take multiple seconds. (qla2xxx comes to mind).
There is a simple test to highlight this limitation. Take any
nvme-target driver that is capable of multiple ports, and introduce
a sleep(5) into each ops->add_port() call.
Now create 256 different subsystem NQNs with 256 different ports
across four different user-space processes.
What happens to other subsystems, ports and host groups configfs
symlinks when this occurs..?
What happens to the other user-space processes..?