On Wed, Jul 12, 2023 at 6:01 AM Jason Gunthorpe <jgg@xxxxxxxx> wrote:
On Tue, Jul 11, 2023 at 08:42:24PM -0700, Mina Almasry wrote:I think maybe you misunderstood the proposal. AFAICT the genalloc
1. The device memory driver would be the p2pdma provider. It wouldThis is not quite right, if you convert any of the GPU drivers to use
expose a user API which allocates a device memory region, calls
pci_p2pdma_add_resource() and pci_p2pmem_publish() on it, and returns
a reference to it to the userspace.
P2PDMA you are going to need to restructure the p2pmem stuff to
seperate the genalloc. The GPU driver must continue to be the owner
and allocator of the MMIO memory it already controls, we can't have
two allocators working in parallel.
The genalloc stuff supports the special NVMe use case, I don't know of
anything else that would like to work that way.
stuff works for us, although there are other issues with p2pdma that I
need to solve.
The proposal was that the uapi in step #1 allocates a region of GPU
memory, and sets up a p2pdma provider for this region of memory. From
the perspective of the GPU, the memory is allocated, and in use by the
user. The p2pdma provider uses genalloc to give out 4K regions with
struct pages to in-kernel allocators from this memory region. Why
would that not work?
Looking at the code, that seems to be how p2pdma
works today. The p2pdma provider does p2pdma_add_resource() on a chunk
of its memory, and the genalloc allocates memory from that chunk?
The actual issues I see with this approach are:
1. There is no way for the p2pdma provider to relinquish back the
memory it has provided via pci_p2pdma_add_resource(), in the case that
the user crashed or would like to free the GPU buffer. I would need to
add a pci_p2pdma_remove_resource(). Would that be acceptable?
2. The p2pdma semantics seem to be global to the pci device. I.e., 1
GPU can export 1 p2pdma resource at a time (the way I'm reading the
API). This is not usable for my use case. I would need multiple users
to be able to use the uapi in step #1 simultaneously. I would need the
same pci device to export different p2pdma resources simultaneously
and the p2pdma clients would need to be able to import some of the
resources. I would likely need to add an api like this:
diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
index 8318a97c9c61..c9d754713fdc 100644
--- a/include/linux/pci-p2pdma.h
+++ b/include/linux/pci-p2pdma.h
@@ -19,6 +19,33 @@ struct scatterlist;
#ifdef CONFIG_PCI_P2PDMA
int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
u64 offset);
+
+/* Adds a resource similar to pci_p2pdma_add_resource, and returns a file
+ * handle referring to this resource. Multiple such resources can be exported
+ * by the same pci device.
+ */
+struct file *pci_p2pdma_add_resource_with_handle(struct pci_dev *pdev,
+ int bar,
+ size_t size,
+ u64 offset);
+
+/* Remove a resource added via pci_p2pdma_add_resource_with_handle() */
+struct file *pci_p2pdma_remove_resource_with_handle(
+ struct file *p2pdma_resource_file);
+
+/* Allocates memory from a resource created using
+ * pci_p2pdma_add_resource_with_handle()
+ */
+void *pci_alloc_p2pmem_from_handle(struct file *p2pdma_resource_file,
+ size_t size);
+
+/* Frees p2pmem to a resource created using
+ * pci_p2pdma_add_resource_with_handle()
+ */
+void pci_free_p2pmem_to_handle(struct pci_dev *p2pdma_resource_file,
+ void *addr,
+ size_t size);
+
Looking for feedback from anyone knowledgeable, but the p2pdma
maintainers as well if possibl.
I think this comment is maybe a side effect of the misunderstanding.2. The NIC driver would be the p2pdma client and orchestrator. ItThis doesn't fit the programming model for GPUs at all. You don't want
would expose a user API which binds an rxq to a pci device. Prior to
the bind the user API would check that the pci device has published
p2p memory (pci_has_p2pmem()), and check the the p2p mem is accessible
to the driver (pci_p2pdma_distance() I think), etc.
to get packets landing in random GPU memory that a kernel side
allocator selects, you want packets landing in GPU memory owned by a
specific process that owns the TCP connection.
In the proposal, the user allocates a GPU buffer using the API in step
#1, and then binds the memory to the NIC rx queue using the API
specified in step #2. We use flow steering & RSS to steer this user's
TCP traffic to the buffer owned by them.
This is why DMABUF is used here as it gives a handle to the GPU
memory. What you want is to get the P2P pages either directly from the
DMABUF or via pin_user_pages() on the DMABUF's mmap.
AFAICT, all the concerns brought up in this thread are sidestepped byWell, as I said it is going to be a big ask to P2P enable any of the
using p2pdma. I need not allocate struct pages in the core dma-buf
code anymore (or anywhere), and I need not allocate pgmaps. I would
just re-use the p2pdma support.
DRM drivers.
And you still have the netmem vs zone_device struct page conflict to
figure out
But it is alot closer to reasonable than this RFC.
Jason