Re: [PATCH 1/1] eventfd new tag EFD_VPOLL: generate epoll events

From: Roman Penyaev
Date: Fri May 31 2019 - 07:52:24 EST


On 2019-05-31 12:45, Renzo Davoli wrote:
HI Roman,

On Fri, May 31, 2019 at 11:34:08AM +0200, Roman Penyaev wrote:
On 2019-05-27 15:36, Renzo Davoli wrote:
> Unfortunately this approach cannot be applied to
> poll/select/ppoll/pselect/epoll.

If you have to override other systemcalls, what is the problem to override
poll family? It will add, let's say, 50 extra code lines complexity to your
userspace code. All you need is to be woken up by *any* event and check
one mask variable, in order to understand what you need to do: read or
write,
basically exactly what you do in your eventfd modification, but only in
userspace.

This approach would not scale. If I want to use both a (user-space)
network stack
and a (emulated) device (or more stacks and devices) which
(overridden) poll would I use?

The poll of the first stack is not able to to deal with the third device.

Since each such a stack has a set of read/write/etc functions you always
can extend you stack with another call which returns you event mask,
specifying what exactly you have to do, e.g.:

nfds = epoll_wait(epollfd, events, MAX_EVENTS, -1);
for (n = 0; n < nfds; ++n) {
struct sock *sock;

sock = events[n].data.ptr;
events = sock->get_events(sock, &events[n]);

if (events & EPOLLIN)
sock->read(sock);
if (events & EPOLLOUT)
sock->write(sock);
}


With such a virtual table you can mix all userspace stacks and even
with normal sockets, for which 'get_events' function can be declared as

static poll_t kernel_sock_get_events(struct sock *sock, struct epoll_event *ev)
{
return ev->events;
}

Do I miss something?


> > Why can it not be less than 64?
> This is the imeplementation of 'write'. The 64 bits include the
> 'command'
> EFD_VPOLL_ADDEVENTS, EFD_VPOLL_DELEVENTS or EFD_VPOLL_MODEVENTS (in the
> most
> significant 32 bits) and the set of events (in the lowest 32 bits).

Do you really need add/del/mod semantics? Userspace still has to keep mask
somewhere, so you can have one simple command, which does:
ctx->count = events;
in kernel, so no masks and this games with bits are needed. That will
simplify API.

It is true, at the price to have more complex code in user space.
Other system calls could have beeen implemented as "set the value",
instead there are
ADD/DEL modification flags.
I mean for example sigprocmask (SIG_BLOCK, SIG_UNBLOCK, SIG_SETMASK),
or even epoll_ctl.
While poll requires the program to keep the struct pollfd array stored
somewhere,
epoll is more powerful and flexible as different file descriptors can be added
and deleted by different modules/components.

If I have two threads implementing the send and receive path of a
socket in a user-space

Eventually you come up with such a lock to protect your tcp or whatever
state machine. Or you have a real example where read and write paths
can work completely independently?

--
Roman