Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

From: Benjamin Herrenschmidt
Date: Wed Feb 28 2018 - 22:56:28 EST


On Thu, 2018-03-01 at 14:54 +1100, Benjamin Herrenschmidt wrote:
> On Wed, 2018-02-28 at 16:39 -0700, Logan Gunthorpe wrote:
> > Hi Everyone,
>
>
> So Oliver (CC) was having issues getting any of that to work for us.
>
> The problem is that acccording to him (I didn't double check the latest
> patches) you effectively hotplug the PCIe memory into the system when
> creating struct pages.
>
> This cannot possibly work for us. First we cannot map PCIe memory as
> cachable. (Note that doing so is a bad idea if you are behind a PLX
> switch anyway since you'd ahve to manage cache coherency in SW).

Note: I think the above means it won't work behind a switch on x86
either, will it ?

> Then our MMIO space is so far away from our memory space that there is
> not enough vmemmap virtual space to be able to do that.
>
> So this can only work accross achitectures by using something like HMM
> to create special device struct page's.
>
> Ben.
>
>
> > Here's v2 of our series to introduce P2P based copy offload to NVMe
> > fabrics. This version has been rebased onto v4.16-rc3 which already
> > includes Christoph's devpagemap work the previous version was based
> > off as well as a couple of the cleanup patches that were in v1.
> >
> > Additionally, we've made the following changes based on feedback:
> >
> > * Renamed everything to 'p2pdma' per the suggestion from Bjorn as well
> > as a bunch of cleanup and spelling fixes he pointed out in the last
> > series.
> >
> > * To address Alex's ACS concerns, we change to a simpler method of
> > just disabling ACS behind switches for any kernel that has
> > CONFIG_PCI_P2PDMA.
> >
> > * We also reject using devices that employ 'dma_virt_ops' which should
> > fairly simply handle Jason's concerns that this work might break with
> > the HFI, QIB and rxe drivers that use the virtual ops to implement
> > their own special DMA operations.
> >
> > Thanks,
> >
> > Logan
> >
> > --
> >
> > This is a continuation of our work to enable using Peer-to-Peer PCI
> > memory in NVMe fabrics targets. Many thanks go to Christoph Hellwig who
> > provided valuable feedback to get these patches to where they are today.
> >
> > The concept here is to use memory that's exposed on a PCI BAR as
> > data buffers in the NVME target code such that data can be transferred
> > from an RDMA NIC to the special memory and then directly to an NVMe
> > device avoiding system memory entirely. The upside of this is better
> > QoS for applications running on the CPU utilizing memory and lower
> > PCI bandwidth required to the CPU (such that systems could be designed
> > with fewer lanes connected to the CPU). However, presently, the
> > trade-off is currently a reduction in overall throughput. (Largely due
> > to hardware issues that would certainly improve in the future).
> >
> > Due to these trade-offs we've designed the system to only enable using
> > the PCI memory in cases where the NIC, NVMe devices and memory are all
> > behind the same PCI switch. This will mean many setups that could likely
> > work well will not be supported so that we can be more confident it
> > will work and not place any responsibility on the user to understand
> > their topology. (We chose to go this route based on feedback we
> > received at the last LSF). Future work may enable these transfers behind
> > a fabric of PCI switches or perhaps using a white list of known good
> > root complexes.
> >
> > In order to enable this functionality, we introduce a few new PCI
> > functions such that a driver can register P2P memory with the system.
> > Struct pages are created for this memory using devm_memremap_pages()
> > and the PCI bus offset is stored in the corresponding pagemap structure.
> >
> > Another set of functions allow a client driver to create a list of
> > client devices that will be used in a given P2P transactions and then
> > use that list to find any P2P memory that is supported by all the
> > client devices. This list is then also used to selectively disable the
> > ACS bits for the downstream ports behind these devices.
> >
> > In the block layer, we also introduce a P2P request flag to indicate a
> > given request targets P2P memory as well as a flag for a request queue
> > to indicate a given queue supports targeting P2P memory. P2P requests
> > will only be accepted by queues that support it. Also, P2P requests
> > are marked to not be merged seeing a non-homogenous request would
> > complicate the DMA mapping requirements.
> >
> > In the PCI NVMe driver, we modify the existing CMB support to utilize
> > the new PCI P2P memory infrastructure and also add support for P2P
> > memory in its request queue. When a P2P request is received it uses the
> > pci_p2pmem_map_sg() function which applies the necessary transformation
> > to get the corrent pci_bus_addr_t for the DMA transactions.
> >
> > In the RDMA core, we also adjust rdma_rw_ctx_init() and
> > rdma_rw_ctx_destroy() to take a flags argument which indicates whether
> > to use the PCI P2P mapping functions or not.
> >
> > Finally, in the NVMe fabrics target port we introduce a new
> > configuration boolean: 'allow_p2pmem'. When set, the port will attempt
> > to find P2P memory supported by the RDMA NIC and all namespaces. If
> > supported memory is found, it will be used in all IO transfers. And if
> > a port is using P2P memory, adding new namespaces that are not supported
> > by that memory will fail.
> >
> > Logan Gunthorpe (10):
> > PCI/P2PDMA: Support peer to peer memory
> > PCI/P2PDMA: Add sysfs group to display p2pmem stats
> > PCI/P2PDMA: Add PCI p2pmem dma mappings to adjust the bus offset
> > PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
> > block: Introduce PCI P2P flags for request and request queue
> > IB/core: Add optional PCI P2P flag to rdma_rw_ctx_[init|destroy]()
> > nvme-pci: Use PCI p2pmem subsystem to manage the CMB
> > nvme-pci: Add support for P2P memory in requests
> > nvme-pci: Add a quirk for a pseudo CMB
> > nvmet: Optionally use PCI P2P memory
> >
> > Documentation/ABI/testing/sysfs-bus-pci | 25 ++
> > block/blk-core.c | 3 +
> > drivers/infiniband/core/rw.c | 21 +-
> > drivers/infiniband/ulp/isert/ib_isert.c | 5 +-
> > drivers/infiniband/ulp/srpt/ib_srpt.c | 7 +-
> > drivers/nvme/host/core.c | 4 +
> > drivers/nvme/host/nvme.h | 8 +
> > drivers/nvme/host/pci.c | 118 ++++--
> > drivers/nvme/target/configfs.c | 29 ++
> > drivers/nvme/target/core.c | 95 ++++-
> > drivers/nvme/target/io-cmd.c | 3 +
> > drivers/nvme/target/nvmet.h | 10 +
> > drivers/nvme/target/rdma.c | 43 +-
> > drivers/pci/Kconfig | 20 +
> > drivers/pci/Makefile | 1 +
> > drivers/pci/p2pdma.c | 713 ++++++++++++++++++++++++++++++++
> > drivers/pci/pci.c | 4 +
> > include/linux/blk_types.h | 18 +-
> > include/linux/blkdev.h | 3 +
> > include/linux/memremap.h | 19 +
> > include/linux/pci-p2pdma.h | 105 +++++
> > include/linux/pci.h | 4 +
> > include/rdma/rw.h | 7 +-
> > net/sunrpc/xprtrdma/svc_rdma_rw.c | 6 +-
> > 24 files changed, 1204 insertions(+), 67 deletions(-)
> > create mode 100644 drivers/pci/p2pdma.c
> > create mode 100644 include/linux/pci-p2pdma.h
> >
> > --
> > 2.11.0