On 5/31/19 1:45 PM, Roman Penyaev wrote:
On 2019-05-31 18:54, Jens Axboe wrote:
On 5/31/19 10:02 AM, Roman Penyaev wrote:
On 2019-05-31 16:48, Jens Axboe wrote:
On 5/16/19 2:57 AM, Roman Penyaev wrote:
This is v3 which introduces pollable epoll from userspace.
- Measurements made, represented below.
- Fix alignment for epoll_uitem structure on all 64-bit archs
x86-64. epoll_uitem should be always 16 bit, proper
is added. (Linus)
- Check pollflags explicitly on 0 inside work callback, and do
- No reallocations, the max number of items (thus size of the
is specified by the caller.
- Interface is simplified: -ENOSPC is returned on attempt to add
epoll item if number is reached the max, nothing more.
- Alloced pages are accounted using user->locked_vm and limited
- EPOLLONESHOT is handled.
This series introduces pollable epoll from userspace, i.e. user
epfd with a new EPOLL_USERPOLL flag, mmaps epoll descriptor, gets
and ring pointers and then consumes ready events from a ring,
epoll_wait() call. When ring is empty, user has to call
in order to wait for new events. epoll_wait() returns -ESTALE if
ring has events in the ring (kind of indication, that user has to
events from the user ring first, I could not invent anything better
For user header and user ring allocation I used vmalloc_user(). I
that it is much easy to reuse remap_vmalloc_range_partial() instead
dealing with page cache (like aio.c does). What is also nice is
virtual address is properly aligned on SHMLBA, thus there should not
any d-cache aliasing problems on archs with vivt or vipt caches.
Why aren't we just adding support to io_uring for this instead? Then
don't need yet another entirely new ring, that's is just a little
different from what we have.
I haven't looked into the details of your implementation, just
if there's anything that makes using io_uring a non-starter for this
Afaict the main difference is that you do not need to recharge an fd
(submit new poll request in terms of io_uring): once fd has been added
epoll with epoll_ctl() - we get events. When you have thousands of
that should matter.
Also interesting question is how difficult to modify existing event
in event libraries in order to support recharging (EPOLLONESHOT in
Maybe Azat who maintains libevent can shed light on this (currently I
that libevent does not support "EPOLLONESHOT" logic).
In terms of existing io_uring poll support, which is what I'm guessing
you're referring to, it is indeed just one-shot.
But there's no reason why we can't have it persist until explicitly
canceled with POLL_REMOVE.
It seems not so easy. The main problem is that with only a ring it is
impossible to figure out on kernel side what event bits have been
seen by the userspace and what bits are new. So every new cqe has to
be added to a completion ring on each wake_up_interruptible() call.
(I mean when fd wants to report that something is ready).
IMO that can lead to many duplicate events (tens? hundreds? honestly no
idea), which userspace has to handle with subsequent read/write calls.
It can kill all performance benefits of a uring.
In uepoll this is solved with another piece of shared memory, where
userspace atomically clears bits and kernel side sets bits. If kernel
observes that bits were set (i.e. userspace has not seen this event)
- new index is added to a ring.
Those are good points.
Can we extend the io_uring API to support this behavior? Also would
be great if we can make event path lockless. On a big number of fds
and frequent events - this matters, please take a look, recently I
did some measurements: https://lkml.org/lkml/2018/12/12/305
Yeah, I'd be happy to entertain that idea, and lockless completions as
well. We already do that for polled IO, but consider any "normal"
completion to be IRQ driven and hence need locking.