Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

From: Benjamin Herrenschmidt
Date: Wed Apr 12 2017 - 02:24:27 EST


On Thu, 2017-03-30 at 16:12 -0600, Logan Gunthorpe wrote:
> Hello,
>
> As discussed at LSF/MM we'd like to present our work to enable
> copy offload support in NVMe fabrics RDMA targets. We'd appreciate
> some review and feedback from the community on our direction.
> This series is not intended to go upstream at this point.
>
> The concept here is to use memory that's exposed on a PCI BAR as
> data buffers in the NVME target code such that data can be transferred
> from an RDMA NIC to the special memory and then directly to an NVMe
> device avoiding system memory entirely. The upside of this is better
> QoS for applications running on the CPU utilizing memory and lower
> PCI bandwidth required to the CPU (such that systems could be designed
> with fewer lanes connected to the CPU). However, presently, the trade-off
> is currently a reduction in overall throughput. (Largely due to hardware
> issues that would certainly improve in the future).

Another issue of course is that not all systems support P2P
between host bridges :-) (Though almost all switches can enable it).

> Due to these trade-offs we've designed the system to only enable using
> the PCI memory in cases where the NIC, NVMe devices and memory are all
> behind the same PCI switch.

Ok. I suppose that's a reasonable starting point. Do I haven't looked
at the patches in detail yet but it would be nice if that policy was in
a well isolated component so it can potentially be affected by
arch/platform code.

Do you handle funky address translation too ? IE. the fact that the PCI
addresses aren't the same as the CPU physical addresses for a BAR ?

> This will mean many setups that could likely
> work well will not be supported so that we can be more confident it
> will work and not place any responsibility on the user to understand
> their topology. (We've chosen to go this route based on feedback we
> received at LSF).
>
> In order to enable this functionality we introduce a new p2pmem device
> which can be instantiated by PCI drivers. The device will register some
> PCI memory as ZONE_DEVICE and provide an genalloc based allocator for
> users of these devices to get buffers.

I don't completely understand this. This is actual memory on the PCI
bus ? Where does it come from ? Or are you just trying to create struct
pages that cover your PCIe DMA target ?

> We give an example of enabling
> p2p memory with the cxgb4 driver, however currently these devices have
> some hardware issues that prevent their use so we will likely be
> dropping this patch in the future. Ideally, we'd want to enable this
> functionality with NVME CMB buffers, however we don't have any hardware
> with this feature at this time.

So correct me if I'm wrong, you are trying to create struct page's that
map a PCIe BAR right ? I'm trying to understand how that interacts with
what Jerome is doing for HMM.

The reason is that the HMM currently creates the struct pages with
"fake" PFNs pointing to a hole in the address space rather than
covering the actual PCIe memory of the GPU. He does that to deal with
the fact that some GPUs have a smaller aperture on PCIe than their
total memory.

However, I have asked him to only apply that policy if the aperture is
indeed smaller, and if not, create struct pages that directly cover the
PCIe BAR of the GPU instead, which will work better on systems or
architecture that don't have a "pinhole" window limitation.

However he was under the impression that this was going to collide with
what you guys are doing, so I'm trying to understand how.

> In nvmet-rdma, we attempt to get an appropriate p2pmem device at
> queue creation time and if a suitable one is found we will use it for
> all the (non-inlined) memory in the queue. An 'allow_p2pmem' configfs
> attribute is also created which is required to be set before any p2pmem
> is attempted.
>
> This patchset also includes a more controversial patch which provides an
> interface for userspace to obtain p2pmem buffers through an mmap call on
> a cdev. This enables userspace to fairly easily use p2pmem with RDMA and
> O_DIRECT interfaces. However, the user would be entirely responsible for
> knowing what their doing and inspecting sysfs to understand the pci
> topology and only using it in sane situations.
>
> Thanks,
>
> Logan
>
>
> Logan Gunthorpe (6):
> Â Introduce Peer-to-Peer memory (p2pmem) device
> Â nvmet: Use p2pmem in nvme target
> Â scatterlist: Modify SG copy functions to support io memory.
> Â nvmet: Be careful about using iomem accesses when dealing with p2pmem
> Â p2pmem: Support device removal
> Â p2pmem: Added char device user interface
>
> Steve Wise (2):
> Â cxgb4: setup pcie memory window 4 and create p2pmem region
> Â p2pmem: Add debugfs "stats" file
>
> Âdrivers/memory/KconfigÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ|ÂÂÂ5 +
> Âdrivers/memory/MakefileÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ|ÂÂÂ2 +
> Âdrivers/memory/p2pmem.cÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ| 697 ++++++++++++++++++++++++
> Âdrivers/net/ethernet/chelsio/cxgb4/cxgb4.hÂÂÂÂÂÂ|ÂÂÂ3 +
> Âdrivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c |ÂÂ97 +++-
> Âdrivers/net/ethernet/chelsio/cxgb4/t4_regs.hÂÂÂÂ|ÂÂÂ5 +
> Âdrivers/nvme/target/configfs.cÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ|ÂÂ31 ++
> Âdrivers/nvme/target/core.cÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ|ÂÂ18 +-
> Âdrivers/nvme/target/fabrics-cmd.cÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ|ÂÂ28 +-
> Âdrivers/nvme/target/nvmet.hÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ|ÂÂÂ2 +
> Âdrivers/nvme/target/rdma.cÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ| 183 +++++--
> Âdrivers/scsi/scsi_debug.cÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ|ÂÂÂ7 +-
> Âinclude/linux/p2pmem.hÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ| 120 ++++
> Âinclude/linux/scatterlist.hÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ|ÂÂÂ7 +-
> Âlib/scatterlist.cÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ|ÂÂ64 ++-
> Â15 files changed, 1189 insertions(+), 80 deletions(-)
> Âcreate mode 100644 drivers/memory/p2pmem.c
> Âcreate mode 100644 include/linux/p2pmem.h
>
> --
> 2.1.4