Re: [RFC PATCH v3 05/12] netdev: netdevice devmem allocator

From: Pavel Begunkov
Date: Fri Nov 10 2023 - 13:19:01 EST


On 11/7/23 23:03, Mina Almasry wrote:
On Tue, Nov 7, 2023 at 2:55 PM David Ahern <dsahern@xxxxxxxxxx> wrote:

On 11/7/23 3:10 PM, Mina Almasry wrote:
On Mon, Nov 6, 2023 at 3:44 PM David Ahern <dsahern@xxxxxxxxxx> wrote:

On 11/5/23 7:44 PM, Mina Almasry wrote:
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index eeeda849115c..1c351c138a5b 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -843,6 +843,9 @@ struct netdev_dmabuf_binding {
};

#ifdef CONFIG_DMA_SHARED_BUFFER
+struct page_pool_iov *
+netdev_alloc_devmem(struct netdev_dmabuf_binding *binding);
+void netdev_free_devmem(struct page_pool_iov *ppiov);

netdev_{alloc,free}_dmabuf?


Can do.

I say that because a dmabuf can be host memory, at least I am not aware
of a restriction that a dmabuf is device memory.


In my limited experience dma-buf is generally device memory, and
that's really its use case. CONFIG_UDMABUF is a driver that mocks
dma-buf with a memfd which I think is used for testing. But I can do
the rename, it's more clear anyway, I think.

config UDMABUF
bool "userspace dmabuf misc driver"
default n
depends on DMA_SHARED_BUFFER
depends on MEMFD_CREATE || COMPILE_TEST
help
A driver to let userspace turn memfd regions into dma-bufs.
Qemu can use this to create host dmabufs for guest framebuffers.


Qemu is just a userspace process; it is no way a special one.

Treating host memory as a dmabuf should radically simplify the io_uring
extension of this set.

I agree actually, and I was about to make that comment to David Wei's
series once I have the time.

David, your io_uring RX zerocopy proposal actually works with devmem
TCP, if you're inclined to do that instead, what you'd do roughly is
(I think):
That would be a Frankenstein's monster api with no good reason for it.
You bind memory via netlink because you don't have a proper context to
work with otherwise, io_uring serves as the context with a separate and
precise abstraction around queues. Same with dmabufs, it totally makes
sense for device memory, but wrapping host memory into a file just to
immediately unwrap it back with no particular benefits from doing so
doesn't seem like a good uapi. Currently, the difference will be
hidden by io_uring.

And we'd still need to have a hook in pp's get page to grab buffers from
the buffer ring instead of refilling via SO_DEVMEM_DONTNEED and a
callback for when skbs are dropped. It's just instead of a new pp ops
it'll be a branch in the devmem path. io_uring might want to use the
added iov format in the future for device memory or even before that,
io_uring doesn't really care whether it's pages or not.

It's also my big concern from how many optimisations it'll fence us off.
With the current io_uring RFC I can get rid of all buffer atomic
refcounting and replace it with a single percpu counting per skb.
Hopefully, that will be doable after we place it on top of pp providers.


- Allocate a memfd,
- Use CONFIG_UDMABUF to create a dma-buf out of that memfd.
- Bind the dma-buf to the NIC using the netlink API in this RFC.
- Your io_uring extensions and io_uring uapi should work as-is almost
on top of this series, I think.

If you do this the incoming packets should land into your memfd, which
may or may not work for you. In the future if you feel inclined to use
device memory, this approach that I'm describing here would be more
extensible to device memory, because you'd already be using dma-bufs
for your user memory; you'd just replace one kind of dma-buf (UDMABUF)
with another.

That the io_uring set needs to dive into
page_pools is just wrong - complicating the design and code and pushing
io_uring into a realm it does not need to be involved in.

I disagree. How does it complicate it? io_uring will be just a yet another
provider implementing the callbacks of the API created for such use cases
and not changing common pp/net bits. The rest of code is in io_uring
implementing interaction with userspace and other usability features, but
there will be anyway some amount of code if we want to have a convenient
and performant api via io_uring.



Most (all?) of this patch set can work with any memory; only device
memory is unreadable.

--
Pavel Begunkov