On Wed, Mar 6, 2024 at 11:14 AM Pavel Begunkov <asml.silence@xxxxxxxxx> wrote:..
On 3/6/24 17:04, Mina Almasry wrote:
On Wed, Mar 6, 2024 at 6:30 AM Pavel Begunkov <asml.silence@xxxxxxxxx> wrote:
On 3/5/24 22:36, Mina Almasry wrote:
To be honest, I think it makes sense for the TCP stack to be
responsible for putting the references that belong to it and the
application. To me, it does not make much sense for the page pool to
be responsible for putting the reference that belongs to the TCP stack
or driver via a page_pool_scrub() function, as those references do not
belong to the page pool really. I'm not sure why there is a diff
between our use cases here because I'm not an io_uring expert. Why do
you need to scrub all the references on page pool destruction? Don't
these belong to non-page pool components like io_uring stack or TCP
stack ol otherwise?
That one is about cleaning buffers that are in b/w 4 and 5, i.e.
owned by the user, which devmem does at sock destruction. io_uring
could get by without scrub, dropping user refs while unregistering
ifq, but then it'd need to wait for all requests to finish so there
is no step 4 in the meantime. Might change, can be useful, but it
was much easier to hook into the pp release loop.
Another concern is who and when can reset ifq / kill pp outside
of io_uring/devmem. I assume it can happen on a whim, which is
hard to handle gracefully.
If this is about dropping application refs in step 4 & step 5, then
from devmem TCP perspective it must be done on socket close & skb
freeing AFAIU, and not delayed until page_pool destruction.
Right, something in the kernel should take care of it. You temporarily
attach it to the socket, which is fine. And you could've also stored
it in the netlink socket or some other object. In case of zcrx io_uring
impl, it's bound to io_uring, io_uring is responsible for cleaning them
up. And we do it before __page_pool_destroy, otherwise there would be
a ref dependency.
AFAIU today the page_pool_release() waits until there are no more
pages in flight before calling __page_pool_destroy(), and in existing
use cases it's common for the page_pool to stay alive after
page_pool_destroy() is called until all the skbs waiting in the
receive queue to be recvmsg()'d are received and the page_pool is
freed. I just didn't modify that behavior.
A side note, attaching to netlink or some other global object sounds
conceptually better, as once you return a buffer to the user, the
socket should not have any further business with the buffer. FWIW,
that better resembles io_uring approach. For example allows to:
recv(sock);
close(sock);
process_rx_buffers();
or to return (i.e. DEVMEM_DONTNEED) buffers from different sockets
in one call. However, I don't think it's important for devmem and
perhaps more implementation dictated.
Hmm I think this may be a sockets vs io_uring difference? For sockets
there is no way to recvmsg() new buffers after the sock close and
there is no way to release buffers to the kernel via the setsockopt()
after the sock close.
But I don't think we need to align on everything here, right? If
page_pool_scrub makes more sense in your use case because the lifetime
of the io_uring buffers is different I don't see any issue with you
extending the ops with page_pool_scrub(), and not define it for the
dmabuf_devmem provider which doesn't need a scrub, right?
Think
about a stupid or malicious user that does something like:
1. Set up dmabuf binding using netlink api.
2. While (100000):
3. create devmem TCP socket.
4. receive some devmem data on TCP socket.
5. close TCP socket without calling DEVMEM_DONTNEED.
6. clean up dmabuf binding using netlink api.
In this case, we need to drop the references in step 5 when the socket
is destroyed, so the memory is freed to the page pool and available
for the next socket in step 3. We cannot delay the freeing until step
6 when the rx queue is recreated and the page pool is destroyed,
otherwise the net_iovs would leak in the loop and eventually the NIC
would fail to find available memory. The same bug would be
By "would leak" you probably mean until step 6, right? There are
Yes, sorry I wasn't clear!
always many ways to shoot yourself in the leg. Even if you clean
up in 5, the user can just leak the socket and get the same result
with pp starvation. I see it not as a requirement but rather a
uapi choice, that's assuming netlink would be cleaned as a normal
socket when the task exits.
Yes, thanks for pointing out. The above was a pathological example
meant to describe the point, but I think this generates a realistic
edge case I may run into production. I don't know if you care about
the specifics, but FWIW we split our userspace into an orchestrator
that allocates dma-bufs and binds them via netlink and the ML
application that creates tcp connections. We do this because then the
orchestrator needs CAP_NET_ADMIN for netlink but the ML applications
do not. If we delay dropping references until page_pool_destroy then
we delay dropping references until the orchestrator exists, i.e. we
risk one ML application crashing, leaving references unfreed, and the
next application (that reuses the buffers) seeing a smaller address
space because the previous application did not get to release them
before crash and so on.
Now of course it's possible to work around this by making sure we
don't reuse bound buffers (when they should be reusable for the same
user), but in general I think in the socket use case it's a bit
unnatural IMO for one socket to leave state behind like this and this
would be a subtlety that the userspace needs to take care of, but like
you said, maybe a uapi or buffer lifetime choice.
reproducible with io_uring unless you're creating a new page pool for
each new io_uring socket equivalent.
Surely we don't, but it's still the user's responsibility to
return buffers back. And in case of io_uring buffers returned
to the user are not attached to a socket, so even the
scope / lifetime is a bit different.
Yes, sorry, without understanding the specifics it seems your lifetime
management is different. IMO it's not an issue if we diverge in this
aspect.