Re: Timings for optimised poll(2)

Richard Gooch (rgooch@atnf.CSIRO.AU)
Tue, 26 Aug 1997 17:06:43 +1000


Rogier Wolff writes:
> Dean Gaudet wrote:
> >
> > On 25 Aug 1997, Michael O'Reilly wrote:
> >
> > > This is one of the reasons you use select(). It means you can scan a
> > > bit array very fast.
> >
> > and the kernel scans a normal array really slow underneath. Somewhere
> > someone pays the costs in the select and poll models.
> >
> > select() is O(n) where n is the max descriptor. poll() is O(m) where m is
>
> No:
>
> 1) m = O(n) in the cases that we are interested in.
> 2) The kernel builds a list of the selected-on fd's and indeed needs to
> poll the whole list to get the set of ready descriptors.
>
> > the number of descriptors. Both are linear in their inputs ... m <= n...
> > poll is generally more efficient at the kernel level since select is
> > implemented in terms of poll. Even in the cases where the outputs are
> > sparse and find_first_zero_bit becomes a win, you still had to pay for
> > O(n) operations inside select(). So timing things at the user level only
> > is totally bogus.
>
> My timing example shows that you can totally obliterate the "extra
> overhead" That Richard claims exists for using "select". He wrote some
> code that is 10000 times slower than mine, and then starts shouting
> that we need a different kernel-interface.

As Dean has already pointed out, if you want to check for activity on
a few fds which all have large numbers, then select(2) has to scan
through a large number of bits. So in this case poll(2) has less work
to do.

Example timings (Pentium 100):
1021 file descriptors (3-1023), check for activity on 1014-1023

Using a select(2) loop like this:
memcpy (&i_fds, &input_fds, sizeof i_fds);
memcpy (&o_fds, &output_fds, sizeof i_fds);
memcpy (&e_fds, &exception_fds, sizeof i_fds);
nready = select (max_fd + 1, &i_fds, &o_fds, &e_fds, &tv);
takes 362 microseconds.

Using a poll(2) loop like this:
nready = poll (pollfd_array + start_index, num_to_poll, 0);
takes 93 microseconds.

Now, if we change the test to do checks on descriptors 924-1023:
select(2) takes 1274 microseconds
poll(2) takes 1023 microseconds.

And now if we check descriptors 24-1023:
select(2) takes 3897 microseconds
poll(2) takes 4221 microseconds.

So we can see that poll(2) has a distinct advantage with modest
numbers of high-valued fds. It starts to loose when the number of fds
is increased, since it has to push more bits around than select(2).
The point of my poll2(2) proposal is that less bits are pushed, both
in kernel space as well as user space, compared with poll(2). poll(2)
maintains the advantage of poll(2) when watching modest numbers of
high-valued fds, but also reduces the cost when watching large numbers
of fds with only a few returning active status.

*However*, all this is trimming around the edges, because the basic
problem with the current implementation of polling inside the kernel
is that those indirect poll functions need to be called. If the poll
event mask is kept inside the file struct, then you can remove around
2 milliseconds from poll(2) or select(2) (for 1021 fds). And *this* is
what I want to see fixed first. I've run a number of tests on a
poll2(2) syscall that I'm developing, but in the end the improvements
are only minor in comparison to the biggie above. Another improvement
I've mentioned is that when you speficify a zero timeout, you
shouldn't manipulate the poll_table at all. Removing this saved
another 2 milliseconds.

With the two major improvements I've outlined above, we can check 1021
descriptors using poll(2) in 850 microseconds. *Then* we can argue
about select(2) vs. poll(2) vs. poll2(2). Until then, we're only
talking about a few percent here and there.

Regards,

Richard....