Re: [PATCH v3 00/13] epoll: support pollable epoll from userspace

From: Jens Axboe
Date: Fri May 31 2019 - 10:52:41 EST


On 5/16/19 2:57 AM, Roman Penyaev wrote:
> Hi all,
>
> This is v3 which introduces pollable epoll from userspace.
>
> v3:
> - Measurements made, represented below.
>
> - Fix alignment for epoll_uitem structure on all 64-bit archs except
> x86-64. epoll_uitem should be always 16 bit, proper BUILD_BUG_ON
> is added. (Linus)
>
> - Check pollflags explicitly on 0 inside work callback, and do nothing
> if 0.
>
> v2:
> - No reallocations, the max number of items (thus size of the user ring)
> is specified by the caller.
>
> - Interface is simplified: -ENOSPC is returned on attempt to add a new
> epoll item if number is reached the max, nothing more.
>
> - Alloced pages are accounted using user->locked_vm and limited to
> RLIMIT_MEMLOCK value.
>
> - EPOLLONESHOT is handled.
>
> This series introduces pollable epoll from userspace, i.e. user creates
> epfd with a new EPOLL_USERPOLL flag, mmaps epoll descriptor, gets header
> and ring pointers and then consumes ready events from a ring, avoiding
> epoll_wait() call. When ring is empty, user has to call epoll_wait()
> in order to wait for new events. epoll_wait() returns -ESTALE if user
> ring has events in the ring (kind of indication, that user has to consume
> events from the user ring first, I could not invent anything better than
> returning -ESTALE).
>
> For user header and user ring allocation I used vmalloc_user(). I found
> that it is much easy to reuse remap_vmalloc_range_partial() instead of
> dealing with page cache (like aio.c does). What is also nice is that
> virtual address is properly aligned on SHMLBA, thus there should not be
> any d-cache aliasing problems on archs with vivt or vipt caches.

Why aren't we just adding support to io_uring for this instead? Then we
don't need yet another entirely new ring, that's is just a little
different from what we have.

I haven't looked into the details of your implementation, just curious
if there's anything that makes using io_uring a non-starter for this
purpose?

--
Jens Axboe