Re: Linux's implementation of poll() not scalable?

From: Lincoln Dale (ltd@cisco.com)
Date: Mon Oct 23 2000 - 09:06:14 EST


At 10:39 PM 23/10/2000 -0700, Linus Torvalds wrote:
>First, let's see what is so nice about "select()" and "poll()". They do
>have one _huge_ advantage, which is why you want to fall back on poll()
>once the RT signal interface stops working. What is that?

RT methods are bad if they consume too many resources. SIGIO is a good
example of this - the current overhead of passing events to user-space
incurs both a spinlock and a memory copy of 512 bytes for each
event. while it removes the requirement to "walk lists", the signal
semantics in the kernel and the overhead of memory copies to userspace
negate its performance a fair bit.

that isn't to say that all "event-driven" methods are bad. in the past
year, i've done many experiments at making SIGIO more efficient.

some of these experiments include --
  [1] 'aggregate' events. that is, if you've registered a POLL_IN, no need
      to registered another POLL_IN
      this was marginally successful, but ultimately still didn't scale.

  [2] create a new interface for event delivery.

for i settled on a 16-byte structure sufficient to pass all of the relevant
information:
         typedef struct zerocopy_buf {
                 int fd;
                 short int cmd;
         #define ZEROCOPY_VALID_BUFFER 0xe1e2
                 short int valid_buffer;
                 void *buf; /* skbuff */
         #ifdef __KERNEL__
                 volatile
         #endif
                         struct zerocopy_buf *next;
         } zerocopy_buf_t;

so, we get down to 16 bytes per-event. these are allocated

coupled with this was an interface whereby user-space could view
kernel-space (via read-only mmap).
in my case, this allowed for user-space to be able to read the above chain
of zerocopy_buf events with no kernel-to-user memory copies.

an ioctl on a character driver could ask the kernel to give it the head of
the chain of the current zerocopy_buf structure. a similar ioctl() call
allows it to pass a chain of instructions to the kernel (adding/removing
events from notification) and other housekeeping.

since user-space had read-only visibility into kernel memory address-space,
one could then pick up skbuff's in userspace without the overhead of copies.

... and so-on.

the above is a bit of a simplification of what goes on. using flip-buffers
of queues, one can use this in multiple processes and be SMP-safe without
the requirements for spinlocks or semaphores in the "fast path". solving
the "walk the list of fd's" and "incur the overhead of memory copies" tied
in with network hardware capable of handling scatter/gather DMA and IP and
TCP checksum calculations, i've more than doubled the performance of an
existing application which depended on poll()-type behaviour.

while i agree that it isn't necessarily a 'generic' interface, and won't
necessarilly appeal to everyone as the cure-all, the techniques used have
removed two significant bottlenecks to high-network-performance i/o on
tens-of-thousands of TCP sockets for an application we've been working on.

cheers,

lincoln.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Tue Oct 31 2000 - 21:00:13 EST