Re: epoll and multiple processes - eliminate unneeded process wake-ups

From: Jason Baron
Date: Mon Nov 30 2015 - 14:45:47 EST


Hi Madars,

On 11/28/2015 05:54 PM, Madars Vitolins wrote:
> Hi Jason,
>
> I did recently tests with multiprocessing and epoll() on Posix Queues.
> You were right about "EP_MAX_NESTS", it is not related with how many
> processes are waken up when multiple process epoll_waits are waiting on
> one event source.
>
> At doing epoll every process is added to wait queue for every monitored
> event source. Thus when message is sent to some queue (for example), all
> processes polling on it are activated during mq_timedsend() ->
> __do_notify () -> wake_up(&info->wait_q) kernel processing.
>
> So to get one message to be processed only by one process of
> epoll_wait(), it requires that process in event source's wait queue is
> added with exclusive flag set.
>
> I could create a kernel patch, by adding new EPOLLEXCL flag which could
> result in following functionality:
>
> - fs/eventpoll.c
> ================================================================================
>
> /*
> * This is the callback that is used to add our wait queue to the
> * target file wakeup lists.
> */
> static void ep_ptable_queue_proc(struct file *file, wait_queue_head_t
> *whead,
> poll_table *pt)
> {
> struct epitem *epi = ep_item_from_epqueue(pt);
> struct eppoll_entry *pwq;
>
> if (epi->nwait >= 0 && (pwq = kmem_cache_alloc(pwq_cache,
> GFP_KERNEL))) {
> init_waitqueue_func_entry(&pwq->wait, ep_poll_callback);
> pwq->whead = whead;
> pwq->base = epi;
>
> if (epi->event.events & EPOLLEXCL) { <<<< New
> functionality here!!!
> add_wait_queue_exclusive(whead, &pwq->wait);
> } else {
> add_wait_queue(whead, &pwq->wait);
> }
> list_add_tail(&pwq->llink, &epi->pwqlist);
> epi->nwait++;
> } else {
> /* We have to signal that an error occurred */
> epi->nwait = -1;
> }
> }
> ================================================================================
>
>
> After doing test with EPOLLEXCL set in my multiprocessing application
> framework (now it is open source: http://www.endurox.org/ :) ), results
> were good, there were no extra wakeups. Thus more efficient processing.
>

Cool. If you have any performance numbers to share that would be more
supportive.

> Jason, how do you think would mainline accept such patch with new flag?
> Or are there any concerns about this? Also this will mean that new flag
> will be need to add to GNU C Library (/usr/include/sys/epoll.h).
>

This has come up several times - so imo I think it would be a reasonable
addition - but I'm only speaking for myself.

In terms of implementation it might make sense to return 0 from
ep_poll_callback() in case ep->wq is empty. That way we continue to
search for an active waiter. That way we service wakeups in a more
timely manner if some threads are busy. We probably also don't want to
allow the flag for nested ep descriptors.

Thanks,

-Jason

> Or maybe somebody else who is familiar with kernel epoll functionality
> can comment this?
>
> Regarding the flag's bitmask, seems like (1<<28) needs to be taken for
> EPOLLEXCL as flags type for epoll_event.events is int32 and last bit
> 1<<31 is used by EPOLLET (in include/uapi/linux/eventpoll.h).
>
> Thanks a lot in advance,
> Madars
>
>
> Jason Baron @ 2015-08-05 15:32 rakstÄja:
>> On 08/05/2015 07:06 AM, Madars Vitolins wrote:
>>> Jason Baron @ 2015-08-04 18:02 rakstÄja:
>>>> On 08/03/2015 07:48 PM, Eric Wong wrote:
>>>>> Madars Vitolins <m@xxxxxxxxxxx> wrote:
>>>>>> Hi Folks,
>>>>>>
>>>>>> I am developing kind of open systems application, which uses
>>>>>> multiple processes/executables where each of them monitors some set
>>>>>> of resources (in this case POSIX Queues) via epoll interface. For
>>>>>> example when 10 processes on same queue are in state of epoll_wait()
>>>>>> and one message arrives, all 10 processes gets woken up and all of
>>>>>> them tries to read the message from Q. One succeeds, the others gets
>>>>>> EAGAIN error. The problem is with those others, which generates
>>>>>> extra context switches - useless CPU usage. With more processes
>>>>>> inefficiency gets higher.
>>>>>>
>>>>>> I tried to use EPOLLONESHOT, but no help. Seems this is suitable for
>>>>>> multi-threaded application and not for multi-process application.
>>>>>
>>>>> Correct. Most FDs are not shared across processes.
>>>>>
>>>>>> Ideal mechanism for this would be:
>>>>>> 1. If multiple epoll sets in kernel matches same event and one or
>>>>>> more processes are in state of epoll_wait() - then send event only
>>>>>> to one waiter.
>>>>>> 2. If none of processes are in wait state, then send the event to
>>>>>> all epoll sets (as it is currently). Then the first free process
>>>>>> will grab the event.
>>>>>
>>>>> Jason Baron was working on this (search LKML archives for
>>>>> EPOLLEXCLUSIVE, EPOLLROUNDROBIN, EPOLL_ROTATE)
>>>>>
>>>>> However, I was unconvinced about modifying epoll.
>>>>>
>>>>> Perhaps I may be more easily convinced about your mqueue case than his
>>>>> case for listen sockets, though[*]
>>>>>
>>>>
>>>> Yeah, so I implemented an 'EPOLL_ROTATE' mode, where you could have
>>>> multiple epoll fds (or epoll sets) attached to the same wakeup source,
>>>> and have the wakeups 'rotate' among the epoll sets. The wakeup
>>>> essentially walks the list of waiters, wakes up the first thread
>>>> that is actively in epoll_wait(), stops and moves the woken up
>>>> epoll set to the end of the list. So it attempts to balance
>>>> the wakeups among the epoll sets, I think in the way that you
>>>> were describing.
>>>>
>>>> Here is the patchset:
>>>>
>>>> https://lkml.org/lkml/2015/2/24/667
>>>>
>>>> The test program shows how to use the API. Essentially, you
>>>> have to create a 'dummy' epoll fd with the 'EPOLL_ROTATE' flag,
>>>> which you then attach to you're shared wakeup source and
>>>> then to your epoll sets. Please let me know if its unclear.
>>>>
>>>> Thanks,
>>>>
>>>> -Jason
>>>
>>> In my particular case I need to work with multiple
>>> processes/executables running (not threads) and listening on same
>>> queues (this concept allows to sysadmin easily manage those processes
>>> (start new ones for balancing or stop them with out service
>>> interruption), and if any process dies for some reason (signal, core,
>>> etc..), the whole application does not get killed, but only one
>>> transaction is lost).
>>>
>>> Recently I did tests, and found out that kernel's epoll currently
>>> sends notifications to 4 processes (I think it is EP_MAX_NESTS
>>> constant) waiting on same resource (those other 6 from my example
>>> will stay in sleep state). So it is not as bad as I thought before.
>>> It could be nice if EP_MAX_NESTS could be configurable, but I guess 4
>>> is fine too.
>>>
>>
>> hmmm...EP_MAX_NESTS is about the level 'nesting' epoll sets, IE
>> if you can do ep1->ep2->ep3->ep4-> <wakeup src fd>. But you
>> can't add in 'ep5'. Where the 'epN' above represent epoll file
>> descriptors that are attached together via: EPOLL_CTL_ADD.
>>
>> The nesting does not affect how wakeups are down. All epoll fds
>> that are attached to the even source fd are going to get wakeups.
>>
>>
>>> Jason, does your patch work for multi-process application? How hard
>>> it would be to implement this for such scenario?
>>
>> I don't think it would be too hard, but it requires:
>>
>> 1) adding the patches
>> 2) re-compiling, running new kernel
>> 3) modifying your app to the new API.
>>
>> Thanks,
>>
>> -Jason
>>
>>
>>>
>>> Madars
>>>
>>>>
>>>>> Typical applications have few (probably only one) listen sockets or
>>>>> POSIX mqueues; so I would rather use dedicated threads to issue
>>>>> blocking syscalls (accept4 or mq_timedreceive).
>>>>>
>>>>> Making blocking syscalls allows exclusive wakeups to avoid thundering
>>>>> herds.
>>>>>
>>>>>> How do you think, would it be real to implement this? How about
>>>>>> concurrency?
>>>>>> Can you please give me some hints from which points in code to start
>>>>>> to implement these changes?
>>>>>
>>>>> For now, I suggest dedicating a thread in each process to do
>>>>> mq_timedreceive/mq_receive, assuming you only have a small amount
>>>>> of queues in your system.
>>>>>
>>>>>
>>>>> [*] mq_timedreceive may copy a largish buffer which benefits from
>>>>> staying on the same CPU as much as possible.
>>>>> Contrary, accept4 only creates a client socket. With a C10K+
>>>>> socket server (e.g. http/memcached/DB), a typical new client
>>>>> socket spends a fair amount of time idle. Thus I don't believe
>>>>> memory locality inside the kernel is much concern when there's
>>>>> thousands of accepted client sockets.
>>>>>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/