Re: [PATCH] /dev/epoll update ...

From: Christopher K. St. John (cks@distributopia.com)
Date: Wed Sep 19 2001 - 18:24:41 EST


Davide Libenzi wrote:
>
> 1) select()/poll();
> 2) recv()/send();
>
> vs :
>
> 1) if (recv()/send() == FAIL)
> 2) ioctl(EP_POLL);
>
> When there's no data/tx buffer full these will result in 2 syscalls while
> if data is available/tx buffer ok the first method will result in 2 syscalls
> while the second will never call the ioctl().
> It looks very linear to me, with select()/poll() you're asking for a state while
> with /dev/epoll you're asking for a state change.
>

 Ok, if we're just disagreeing about the best api,
then I can live with that. But it appears we're
talking at cross-purposes, so I want to try this one
more time. I'll lay my though processes out in detail,
and you can tell me at which step I'm going wrong:

 Normally, you'd spend most of your time sitting in
ioctl(EP_POLL) waiting for something to happen. So
that's one syscall.

 If you get an event that indicates you can accept()
a new connection, then you do an accept(). Assume it
succeeds. That's two syscalls. Then you register
interest in the fd with a write to /dev/poll, that's
three.

 With the current /dev/epoll, you must try to read()
the new socket before you go back to ioctl(EP_POLL),
just in case there is data available. You expect
there isn't, but you have to try. This is the step
I'm talking about. That's four.

 Assume data was not available, so you loop back
to ioctl(EP_POLL) and wait for an event. That's five
syscalls. The event comes in, you do another read()
on the socket, and probably get some data. That's
six syscalls to finally get your data.

 ioctl(kpfd, EP_POLL) 1 wait for events
 s = accept() 2 accept a new socket
 write(kpfd, s) 3 register interest
 n = read(s) 4 <-- annoying test-read
 ioctl(kpfd, EP_POLL) 5 wait for events
 n = read(s) 6 get some data

 You have a similiar problem with write's, but I'm
guessing it's safe to assume the first write will
always succeed, so it's awkward but not a big
problem.

 If /dev/epoll tested the initial state of the socket,
then there would be no need for the test read:

 ioctl(kpfd, EP_POLL) 1 wait for events
 s = accept() 2 accept a new socket
 write(kpfd, s) 3 register interest
 ioctl(kpfd, EP_POLL) 4 wait for events
 n = read(s) 5 get some data

 So, we've saved a syscall and, perhaps more importantly,
we don't have to keep a list of to-be-read-just-in-case
fd's sitting around. I wouldrather make this a "clean
api" argument than a performance argument, since it's
unclear that there is really any significant speed
difference in practice.

 Note that the number of unnecessary syscalls is much
greater than 20%, since on a heavily loaded server, you
could be doing 1000's of unecessary reads for every
ioctl(EP_POLL).

 On a fast local network you'd expect the test reads
to mostly return something, so it's no big deal. But
if you've got 10k very slow connections...

 There's a good summary of the problem in the Banga,
Mogul and Druschel[1] paper at:

  http://citeseer.nj.nec.com/banga99scalable.html

 Page 5, right hand column, third paragraph.

 By the way, thanks for the patch. I know I've been
complaining about it, but I wouldn't have bothered
unless I thought it was a good thing. I appreciate
your taking the time to write and release it.

-- 
Christopher St. John cks@distributopia.com
DistribuTopia http://www.distributopia.com
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Sun Sep 23 2001 - 21:00:34 EST