Re: epoll and multiple processes - eliminate unneeded process wake-ups

From: Madars Vitolins
Date: Wed Aug 05 2015 - 07:07:27 EST


Jason Baron @ 2015-08-04 18:02 rakstÄja:
On 08/03/2015 07:48 PM, Eric Wong wrote:
Madars Vitolins <m@xxxxxxxxxxx> wrote:
Hi Folks,

I am developing kind of open systems application, which uses
multiple processes/executables where each of them monitors some set
of resources (in this case POSIX Queues) via epoll interface. For
example when 10 processes on same queue are in state of epoll_wait()
and one message arrives, all 10 processes gets woken up and all of
them tries to read the message from Q. One succeeds, the others gets
EAGAIN error. The problem is with those others, which generates
extra context switches - useless CPU usage. With more processes
inefficiency gets higher.

I tried to use EPOLLONESHOT, but no help. Seems this is suitable for
multi-threaded application and not for multi-process application.

Correct. Most FDs are not shared across processes.

Ideal mechanism for this would be:
1. If multiple epoll sets in kernel matches same event and one or
more processes are in state of epoll_wait() - then send event only
to one waiter.
2. If none of processes are in wait state, then send the event to
all epoll sets (as it is currently). Then the first free process
will grab the event.

Jason Baron was working on this (search LKML archives for
EPOLLEXCLUSIVE, EPOLLROUNDROBIN, EPOLL_ROTATE)

However, I was unconvinced about modifying epoll.

Perhaps I may be more easily convinced about your mqueue case than his
case for listen sockets, though[*]


Yeah, so I implemented an 'EPOLL_ROTATE' mode, where you could have
multiple epoll fds (or epoll sets) attached to the same wakeup source,
and have the wakeups 'rotate' among the epoll sets. The wakeup
essentially walks the list of waiters, wakes up the first thread
that is actively in epoll_wait(), stops and moves the woken up
epoll set to the end of the list. So it attempts to balance
the wakeups among the epoll sets, I think in the way that you
were describing.

Here is the patchset:

https://lkml.org/lkml/2015/2/24/667

The test program shows how to use the API. Essentially, you
have to create a 'dummy' epoll fd with the 'EPOLL_ROTATE' flag,
which you then attach to you're shared wakeup source and
then to your epoll sets. Please let me know if its unclear.

Thanks,

-Jason

In my particular case I need to work with multiple processes/executables running (not threads) and listening on same queues (this concept allows to sysadmin easily manage those processes (start new ones for balancing or stop them with out service interruption), and if any process dies for some reason (signal, core, etc..), the whole application does not get killed, but only one transaction is lost).

Recently I did tests, and found out that kernel's epoll currently sends notifications to 4 processes (I think it is EP_MAX_NESTS constant) waiting on same resource (those other 6 from my example will stay in sleep state). So it is not as bad as I thought before. It could be nice if EP_MAX_NESTS could be configurable, but I guess 4 is fine too.

Jason, does your patch work for multi-process application? How hard it would be to implement this for such scenario?

Madars


Typical applications have few (probably only one) listen sockets or
POSIX mqueues; so I would rather use dedicated threads to issue
blocking syscalls (accept4 or mq_timedreceive).

Making blocking syscalls allows exclusive wakeups to avoid thundering
herds.

How do you think, would it be real to implement this? How about
concurrency?
Can you please give me some hints from which points in code to start
to implement these changes?

For now, I suggest dedicating a thread in each process to do
mq_timedreceive/mq_receive, assuming you only have a small amount
of queues in your system.


[*] mq_timedreceive may copy a largish buffer which benefits from
staying on the same CPU as much as possible.
Contrary, accept4 only creates a client socket. With a C10K+
socket server (e.g. http/memcached/DB), a typical new client
socket spends a fair amount of time idle. Thus I don't believe
memory locality inside the kernel is much concern when there's
thousands of accepted client sockets.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/