Re: Linux's implementation of poll() not scalable?

From: Mitchell Blank Jr (
Date: Tue Oct 24 2000 - 09:09:23 EST

Linus Torvalds wrote:
> Here's a suggested "good" interface that would certainly be easy to
> implement, and very easy to use, with none of the scalability issues that
> many interfaces have.

I think everyone should take a timeout and look at Solaris 8's /dev/poll
interface. This discussion is reinventing the wheel, the lever, and the
inclined plane.

I think this is a lot cleaner than your approach:
  * it doesn't add extra syscalls

  * it handles multiple event queues, and does so without ugliness.
    all the per-queue state is held in the /dev/poll's "struct file"

  * in your method you have a per-process queue - but under what clone()
    conditions is it shared among siblings? Here the user has a choice -
    they can share the open("/dev/poll") file descriptor or not using
    any of the existing means. Granted, they also would probably want
    to arrange to share the fd's being polled in order to make this

  * No new fields in task_struct

A few simple improvements can be made to the Sun model, though:

  * The fact that the fd of /dev/poll can't be used for poll(2) or select(2)
    is ugly. Sure, you probably don't want an open instance of /dev/poll
    being added to another /dev/poll, but being able to call "select" on
    them would be really useful:
      1. Imagine a server that has to process connections from both
         high-priority and low-priority clients - and that requests from
         the high-priority ones always take precedence. With this
         interface you could easily have two open instances of /dev/poll
         and then call select(2) on them. This ability just falls
         out naturally from the interface.
      2. Some libraries are internally driven by select(2) loops (I think
         Xlib is like this, IIRC) If you have a lot of fd's you want to
         watch, this means you must take the hit of calling select(2) on
         all of them. If you could just pass in a fd for /dev/poll,
         problem solved.

  * I think the fact that you add events via write(2) but retrieve them
    via ioctl(2) is an ugly asymmetry. From what I can see, the entire
    motivation for using ioctl as opposed to read(2) is to allow the user
    to specify a timeout. If you could use poll(2)/select(2) on /dev/poll
    this issue would be moot (see above)

  * It would be nice if the interface were flexible enough to report
    items _other_ than "activity on fd" in the future. Maybe SYSV IPC?
    itimers? directory update notification? It seems that every time
    UNIX adds a mechanism of waiting for events, we spoil it by not
    making it flexible enough to wait on everything you might want.
    Lets not repeat the mistake with a new interface.

  * The "struct pollfd" and "struct dvpoll" should also include a 64-bit
    opaque quantity supplied by userland which is returned with each event
    on that fd. This would save the program from having to look up
    which connection context goes with each fd - the kernel could just
    give you the pointer to the structure. Not having this capability isn't
    a burden for programs dealing with a small number of fd's (since they
    can just have a lookup table) but if you potentially have 10000's of
    connections it may be undesirable to allocate an array for them all.

The linux patch of /dev/poll implements mmap'ing of the in-kenrel poll
table... I don't think this is a good idea. First, the user just wants to
be able to add events and dequeue them - both linear operations. Second,
why should the kernel be forced to maintain a fixed-size linear list of
events we're looking for... this seems mindlessly inefficient. Why not
just pull a "struct pollfd" out of slab each time a new event is listened

My unresolved concerns:
  * Is this substantially better than the already existing rtsig+poll
    solution? Enough to make implementing it worth while?

  * How do we quickly do the "struct file" -> "struct pollfd" conversion
    each time an event happens? Especially if there are multiple /dev/poll
    instances open in the current process. Probably each "struct file"
    would need a pointer to the instance of /dev/poll which would have
    some b-tree variant (or maybe a hash table). The limitation would
    be that a single fd couldn't be polled for events by two different
    /dev/poll instances, even for different events. This is probably
    good for sanity's sake anyway.

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to
Please read the FAQ at

This archive was generated by hypermail 2b29 : Tue Oct 31 2000 - 21:00:13 EST