Re: [PATCH v4 00/14] Copy Offload in NVMe Fabrics with P2P PCI Memory

From: Alex Williamson
Date: Tue May 08 2018 - 17:40:49 EST


On Tue, 8 May 2018 17:25:24 -0400
Don Dutile <ddutile@xxxxxxxxxx> wrote:

> On 05/08/2018 12:57 PM, Alex Williamson wrote:
> > On Mon, 7 May 2018 18:23:46 -0500
> > Bjorn Helgaas <helgaas@xxxxxxxxxx> wrote:
> >
> >> On Mon, Apr 23, 2018 at 05:30:32PM -0600, Logan Gunthorpe wrote:
> >>> Hi Everyone,
> >>>
> >>> Here's v4 of our series to introduce P2P based copy offload to NVMe
> >>> fabrics. This version has been rebased onto v4.17-rc2. A git repo
> >>> is here:
> >>>
> >>> https://github.com/sbates130272/linux-p2pmem pci-p2p-v4
> >>> ...
> >>
> >>> Logan Gunthorpe (14):
> >>> PCI/P2PDMA: Support peer-to-peer memory
> >>> PCI/P2PDMA: Add sysfs group to display p2pmem stats
> >>> PCI/P2PDMA: Add PCI p2pmem dma mappings to adjust the bus offset
> >>> PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
> >>> docs-rst: Add a new directory for PCI documentation
> >>> PCI/P2PDMA: Add P2P DMA driver writer's documentation
> >>> block: Introduce PCI P2P flags for request and request queue
> >>> IB/core: Ensure we map P2P memory correctly in
> >>> rdma_rw_ctx_[init|destroy]()
> >>> nvme-pci: Use PCI p2pmem subsystem to manage the CMB
> >>> nvme-pci: Add support for P2P memory in requests
> >>> nvme-pci: Add a quirk for a pseudo CMB
> >>> nvmet: Introduce helper functions to allocate and free request SGLs
> >>> nvmet-rdma: Use new SGL alloc/free helper for requests
> >>> nvmet: Optionally use PCI P2P memory
> >>>
> >>> Documentation/ABI/testing/sysfs-bus-pci | 25 +
> >>> Documentation/PCI/index.rst | 14 +
> >>> Documentation/driver-api/index.rst | 2 +-
> >>> Documentation/driver-api/pci/index.rst | 20 +
> >>> Documentation/driver-api/pci/p2pdma.rst | 166 ++++++
> >>> Documentation/driver-api/{ => pci}/pci.rst | 0
> >>> Documentation/index.rst | 3 +-
> >>> block/blk-core.c | 3 +
> >>> drivers/infiniband/core/rw.c | 13 +-
> >>> drivers/nvme/host/core.c | 4 +
> >>> drivers/nvme/host/nvme.h | 8 +
> >>> drivers/nvme/host/pci.c | 118 +++--
> >>> drivers/nvme/target/configfs.c | 67 +++
> >>> drivers/nvme/target/core.c | 143 ++++-
> >>> drivers/nvme/target/io-cmd.c | 3 +
> >>> drivers/nvme/target/nvmet.h | 15 +
> >>> drivers/nvme/target/rdma.c | 22 +-
> >>> drivers/pci/Kconfig | 26 +
> >>> drivers/pci/Makefile | 1 +
> >>> drivers/pci/p2pdma.c | 814 +++++++++++++++++++++++++++++
> >>> drivers/pci/pci.c | 6 +
> >>> include/linux/blk_types.h | 18 +-
> >>> include/linux/blkdev.h | 3 +
> >>> include/linux/memremap.h | 19 +
> >>> include/linux/pci-p2pdma.h | 118 +++++
> >>> include/linux/pci.h | 4 +
> >>> 26 files changed, 1579 insertions(+), 56 deletions(-)
> >>> create mode 100644 Documentation/PCI/index.rst
> >>> create mode 100644 Documentation/driver-api/pci/index.rst
> >>> create mode 100644 Documentation/driver-api/pci/p2pdma.rst
> >>> rename Documentation/driver-api/{ => pci}/pci.rst (100%)
> >>> create mode 100644 drivers/pci/p2pdma.c
> >>> create mode 100644 include/linux/pci-p2pdma.h
> >>
> >> How do you envison merging this? There's a big chunk in drivers/pci, but
> >> really no opportunity for conflicts there, and there's significant stuff in
> >> block and nvme that I don't really want to merge.
> >>
> >> If Alex is OK with the ACS situation, I can ack the PCI parts and you could
> >> merge it elsewhere?
> >
> > AIUI from previously questioning this, the change is hidden behind a
> > build-time config option and only custom kernels or distros optimized
> > for this sort of support would enable that build option. I'm more than
> > a little dubious though that we're not going to have a wave of distros
> > enabling this only to get user complaints that they can no longer make
> > effective use of their devices for assignment due to the resulting span
> > of the IOMMU groups, nor is there any sort of compromise, configure
> > the kernel for p2p or device assignment, not both. Is this really such
> > a unique feature that distro users aren't going to be asking for both
> > features? Thanks,
> >
> > Alex
> At least 1/2 the cases presented to me by existing customers want it in a tunable kernel,
> and tunable btwn two points, if the hw allows it to be 'contained' in that manner, which
> a (layer of) switch(ing) provides.
> To me, that means a kernel cmdline parameter to _enable_, and another sysfs (configfs? ... i'm not a configfs afficionato to say which is best),
> method to make two points p2p dma capable.

That's not what's done here AIUI. There are also some complications to
making IOMMU groups dynamic, for instance could a downstream endpoint
already be in use by a userspace tool as ACS is being twiddled in
sysfs? Probably the easiest solution would be that all devices
affected by the ACS change are soft unplugged before and re-added after
the ACS change. Note that "affected" is not necessarily only the
downstream devices if the downstream port at which we're playing with
ACS is part of a multifunction device. Thanks,

Alex