Re: Linux's implementation of poll() not scalable?

From: Linus Torvalds (torvalds@transmeta.com)
Date: Tue Oct 24 2000 - 00:39:36 EST


On Mon, 23 Oct 2000, Linus Torvalds wrote:
>
> > What is your favourite interface then ?
>
> I suspect a good interface that can easily be done efficiently would
> basically be something where the user _does_ do the equivalent of a
> read-only mmap() of poll entries - and explicit and controlled
> "add_entry()" and "remove_entry()" controls, so that the kernel can
> maintain the cache without playing tricks.

Actually, forget the mmap, it's not needed.

Here's a suggested "good" interface that would certainly be easy to
implement, and very easy to use, with none of the scalability issues that
many interfaces have.

First, let's see what is so nice about "select()" and "poll()". They do
have one _huge_ advantage, which is why you want to fall back on poll()
once the RT signal interface stops working. What is that?

Basically, RT signals or any kind of event queue has a major fundamental
queuing theory problem: if you have events happening really quickly, the
events pile up, and queuing theory tells you that as you start having
queueing problems, your latency increases, which in turn tends to mean
that later events are even more likely to queue up, and you end up in a
nasty meltdown schenario where your queues get longer and longer.

This is why RT signals suck so badly as a generic interface - clearly we
cannot keep sending RT signals forever, because we'd run out of memory
just keeping the signal queue information around.

Neither poll() nor select() have this problem: they don't get more
expensive as you have more and more events - their expense is the number
of file descriptors, not the number of events per se. In fact, both poll()
and select() tend to perform _better_ when you have pending events, as
they are both amenable to optimizations when there is no need for waiting,
and scanning the arrays can use early-out semantics.

So sticky arrays of events are good, while queues are bad. Let's take that
as one of the fundamentals.

So why do people still like RT signals? They do have one advantage, which
is that you do NOT have that silly array traversal when there is nothing
to do. Basically, the RT signals kind of approach is really good for the
cases where select() and poll() suck: no need to traverse mostly empty and
non-changing arrays all the time.

It boils down to one very simple rule: dense arrays of sticky status
information is good. So let's design a good interface for a dense array.

Basically, the perfect interface for events would be

        struct event {
                unsigned long id; /* file descriptor ID the event is on */
                unsigned long event; /* bitmask of active events */
        };

        int get_events(struct event * event_array, int maxnr, struct timeval *tmout);

where you say "I want an array of pending events, and I have an array you
can fill with up to 'maxnr' events - and if you have no events for me,
please sleep until you get one, or until 'tmout'".

The above looks like a _really_ simple interface to me. Much simpler than
either select() or poll(). Agreed?

Now, we still need to inform the kernel of what kind of events we want, ie
the "binding" of events. The most straightforward way to do that is to
just do a simple "bind_event()" system call:

        int bind_event(int fd, struct event *event);

which basically says: I'm interested in the events in "event" on the file
descriptor "fd". The way to stop being interested in events is to just set
the event bitmask to zero.

Now, the perfect interface would be the above. Nothing more. Nothing
fancy, nothing complicated. Only really simple stuff. Remember the old
rule: "keep it simple, stupid".

The really nice part of the above is that it's trivial to implement. It's
about 50 lines of code, plus some simple logic to various drivers etc to
actually inform about the events. The way to do this simply is to limit it
in very clear ways, the most notable one being simply that there is only
one event queue per process (or rather, per "struct files_struct" - so
threads would automatically share the event queue). This keeps the
implementation simple, but it's also what keeps the interfaces simple: no
queue ID's to pass around etc.

Implementation is straightforward: the event queue basically consists of

 - a queue head in "struct files_struct", initially empty.

 - doing a "bind_event()" basically adds a fasync entry to the file
   structure, but rather than cause a signal, it just looks at whether the
   fasync entry is already linked into the event queue, and if it isn't
   (and the event is one of the ones in the event bitmask), it adds itself
   to the event queue.

 - get_event() just traverses the event queue and fills in the array,
   removing them from the event queue. End of story. If the event queue is
   empty, it trivially sees that in a single line of code (+ timeout
   handling)

Advantage: everything is O(1), except for "get_event()" which is O(n)
where 'n' is the number of active events that it fills in.

Basically, I can't see that it can be done much simpler, and I don't see
that you can have much better performance.

Example "server":

#define MAXEV (32)

        struct event ev;

        .. create socket..
        listen(sock, 5);

        /*
         * start out with only the listening socket
         * as an "event"
         */
        ev.id = LISTEN_ID;
        ev.event = POLLIN;
        bind_event(sock, &ev);

        /* Loop forever, handling the events */
        for (;;) {
                static struct event ev_list[MAXEV];
                int events = get_events(ev_list, MAXEV, NULL);
                struct event *ev = ev_list;

                do {
                        unsigned long id = ev->id;
                        unsigned long mask = ev->mask;
                        ev++;
                        /* Call the event handler */
                        handle[id](id,mask);
                } while (--events);
        }

The interesting part would be what the "handle[]" function pointer array
would do, and for the listen socket it would simply do something like
accepting every pending socket and then add an event binding to it:

        listen_handle(unsigned long id, unsigned long mask)
        {
                while ((fd = accept(listen_sock..)) > 0) {
                        struct event ev;
                        ev.id = RDEVENT;
                        ev.mask = ...;
                        bind_event(fd, &ev);
                }
        }

You get the idea. Very simple, and looks like it should perform quite
admirably. With none of the complexities of signal handling, and none of
the downsides of select() or poll(). Both of the new system calls would be
on the order of 20-30 lines of code (along with the infrastructure to
extend the fasync stuff to also be able to handle events)

Yes, I'm probably missing something. But this is the kind of thing I think
we should look at (and notice that you can _emulate_ this with poll(), so
you can use this kind of interface even on systems that wouldn't support
these kinds of events natively).

                Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Tue Oct 31 2000 - 21:00:13 EST