Re: epoll reporting events when it hasn't been asked to

From: Jamie Lokier
Date: Wed Apr 14 2004 - 16:49:34 EST


Dirk Morris wrote:
> From what I understand you're proposing to remove the fd from the set
> lazily instead of immediately.
> Which will save system calls in the cases were the HUP/ERR condition
> does not occur during the 'disabled' time.
>
> In my case, which you may choose to disregard, this condition is not
> irregular or in any way a special case.
> So the revision you have proposed is just an optimization.
> You could even use this same optimization with the disable feature
> (disable it lazily) and get even better performance with the same number
> of syscalls you proposed.

I don't think you would get any better performance even when HUP/ERR
conditions are commonplace.

A HUP condition means you cannot read & write from the fd any more, so
even though you may defer handling it in userspace, there's nothing to
be gained from disabling all epoll events after you receive the HUP:
instead of lazy disabling, you might as well delete the fd from epoll
as soon as you receive the HUP even though you don't want to handle it yet.
(Btw, the comment about HUP in net/ipv4/tcp.c:tcp_poll() is illuminating).

ERRs can occur many times while a socket is open so the algorithmic
efficiency is worth considering.

An ERR condition, at least on a socket, forces you to examine the
error before you can perform a further read or write. That's because
read and write operations will both check for pending error, so when
you know there's an error condition, you know that the next read or
write call is really a "tell me the error" call.

So, assuming you apply the lazy strategy, after you receive an ERR, in
principle you could decide that you want to do nothing with it until
the next IN or OUT which represents non-error data readiness, and then
you will examine the error code (by doing a read or write call) and
then read or write actual data. Then indeed being able to ignore just
ERR and still listen for IN and/or OUT would make a difference. But
you might as well just read the error condition using MSG_ERRQUEUE,
and spend your one system call that way instead - you can still defer
the processing of the error code.

There is a situation where that is algorithmically not as good as
being able to ignore just ERR: The example is when you are receiving a
malicious flood of ICMP packets which cause lots of error conditions
on a UDP socket (or other similar things), and you want to ignore all
of those while efficiently handling a lower rate of non-error data
transfer. That's a very unusual situation and it doesn't occur except
under attack circumstances with UDP (because real error ICMPs are a
response to something you transmitted yourself).

Other socket types or devices might give ERR a different meaning which
causes them to be common relative to read and write readiness. If so,
they probably shouldn't.

> I see no downside, except that it no longer conforms to the semantics of
> poll and select.
> Whether or not its worth it to deviate from this behavior over such a
> detail, I don't know. :)

I see two downsides. They're not performance downsides, just practical:

1. Currently, you can implement epoll in terms of poll(), if you
have an epoll-based program and want to create epoll emulation
functions for running on an old kernel. If epoll were extended
to permit ignoring HUP/ERR, that would no longer be possible.

2. Perhaps most programs will use a flexible library like libevent
or something made for the program. It's possible that library
will offer an API which sends the POLLIN/OUT/ERR/HUP bits to the
application, and lets the application programmer interpret those
bits in whatever way is appropriate. If it becomes easy to
ignore HUP when the library works with epoll, applications may
accidentally end up depending on that, and will unexpectedly fail
when they are run one day on an older system, or even another OS
where the same library works by calling poll() or select().

Summary: epoll is fine the way it is. Imho.

-- Jamie
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/