Re: Unifying epoll,aio,futexes etc. (What I really want from epoll)

From: Jamie Lokier (lk@tantalophile.demon.co.uk)
Date: Thu Oct 31 2002 - 18:02:15 EST


Davide, I think you are right. That's why I said epoll was _nearly_ perfect :)

Davide Libenzi wrote:
> Jamie, the fact that epoll supports a limited number of "objects" was an
> as-designed at that time. I see it quite easy to extend it to support
> other objects. Futexes are a matter of one line of code int :

Agreed - though I'd prefer if the overhead of creating a temporary fd
for each futex were eliminated, as well as the potentially large fd
table. (In a threaded app, it's reasonable to have many more futexes
than sockets, and they are created and destroyed rapidly too. No data
on how many of those futexes need to be registered, though).

In other words, add another op to sys_futex() called FUTEX_EPOLL which
directly registers the futex on an epoll interest list, and let epoll
report those events as futex events.

(I suspect that is quite easy too).

> Timer, as long as you access them through a file* interface ( like futexes )
> will become trivial too. Another line should be sufficent for dnotify :

Sorry (<humble/>), ignore timers. Somehow I picked up the idea that
epoll_wait() didn't have a timeout from some example or other, which
was very silly of me. I've read the patch properly now! Of course
epoll supports timers - a timeout is quite enough for user space.

> void __inode_dir_notify(struct inode *inode, unsigned long event)

Agreed. This is looking good :)

It's lucky that polling for readability on a directory is not useful
in any other way, though :)

The semantics for this are a bit confusing and inconsistent with
poll(). User gets POLL_RDNORM event which means something in the
directory has changed, not that the directory is now readable or that
poll() would return POLL_RDNORM. It really should be a different
flag, made for the purpose.

> This is the result of a quite quick analysis, but I do not expect it to be
> much more difficult than that.

Someone suggested hooking into ->poll() as a less obtrusive way to
implement epoll. You're probably right that it's quicker to hook
directly as you have done, although it would be great if epoll could
somehow "fall back" to using ->poll() in the cases where epoll isn't
directly supported by a file object.

I wrote quite a lot about futexes above. That's because good futex
support, and fallback to ->poll() would pretty much make epoll
universal. What do you think of these ideas?:

   1. Add FUTEX_EPOLL operation to futex.c, which registers a futex
      with an epoll interest list. This would cause FUTEX_WAKE
      calls on that address to generate epoll events. Some care is
      needed here to keep track of the exact number of events generated,
      because some rather subtle usages of futex depend on the
      return value from futex_wake being the _exact_ number of waiters
      that are woken. It would have to correspond to the exact number
      of events counted by userspace.

   2. Add a check to EP_CTL_ADD which checks whether a file supports
      epoll notifications natively. Perhaps a file_operations hook
      is in order here. If it does, great. If not, fall back to
      a generic mechanism that uses the file's ->poll() method. (I
      haven't thought through for sure how plausible this is).
      Magically, every kind of fd works, including special devices,
      and the things that are most performance critical (sockets,
      pipes, futexes) are tuned. Yum!

   3. Eliminate send_sigio() calls - pass all events to epoll, and let
      epoll dispatch signals where they have been registered. In
      combination with (2), this magically provides SIGIO support for
      all fd types as well.

   4. Merge the aio and epoll event reporting functions: io_getevents
      and epoll_wait are remarkably similar, and should really be one
      function. It would introduce binary incompatibility somewhere though.

There are a few cherries to go on top but I don't want to make this
email any longer. Those are the essentials :)

-- Jamie

ps. Falling back to ->poll also means you can make trees of
notifications, like Alan suggested :)

pps. I much prefer epoll's use of an fd for the interest list than
aio's aio_context_t.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Thu Oct 31 2002 - 22:00:57 EST