Re: PROBLEM: epoll_wait does not obey edge triggering semantics for hierarchically constructed epoll sets

From: Jason Baron
Date: Thu Jan 18 2018 - 12:42:14 EST




On 01/17/2018 04:29 PM, Nick Murphy wrote:
> Thanks.
>
> Yeah, I didn't track it down, but I suspect this behavior has always
> been there. I do think it's ultimately incorrect behavior (i.e., a
> violation of edge triggering semantics as I note in the initial
> report). The implications of this is that we'd like the option to
> construct hierarchical epoll sets and use them in generic code that
> uses epoll_wait and expects edge triggering, but we can't (because
> epoll fd's behave differently from other fd's). I'm a little
> surprised others haven't come across this.
>
> I'm not really a kernel developer nor am I particularly familiar with
> this code, so I'm unclear how ugly a fix would be...I can imagine
> there may be locking issues with recursively traversing child epoll
> fd's?...
>
> Nick

Hi,

The current behavior for the 'outer' epoll fd could be described as
being level triggered regardless if its attached to the 'inner' epoll fd
as edge or level triggered. Afaict they act the same.

The typical usage pattern for nested epoll would seem to be to
epoll_wait() on the 'outer' epoll fd, and when that triggers to then
call epoll_wait() on the 'inner' epoll fd that triggered. Once those
'inner' events have been processed, then you return to wait on the
'outer' epoll fd. At that point the 'outer' epoll fd will no longer show
events and thus you have something resembling edge triggering for the
'outer' epoll fd in the current code.

If that's not sufficient for what you are trying to do, you could try
the patch I attached in the previous mail, and see if it matches what
you are expecting....

Thanks,

-Jason


>
> On Wed, Jan 17, 2018 at 9:21 AM, Jason Baron <jbaron@xxxxxxxxxx> wrote:
>>
>>
>> On 01/12/2018 07:06 PM, Nick Murphy wrote:
>>> [1.] One line summary of the problem:
>>> epoll_wait does not obey edge triggering semantics for file
>>> descriptors which are themselves epoll file descriptors (i.e., epoll
>>> fd's added to an epoll set with the EPOLLET flag)
>>>
>>> [2.] Full description of the problem/report:
>>> When executing the following sequence:
>>> 1) create and add an event fd (for example) to an inner epoll set
>>> 2) add the inner epoll fd to an outer epoll set (with EPOLLET flag set)
>>> 3) write to (increase the value of) the event fd
>>> 4) epoll_wait on outer fd
>>> 5) epoll_wait on outer fd again
>>>
>>> Edge triggering semantics imply that the epoll_wait in step 5 should
>>> block (nothing has changed). It does not. It returns immediately.
>>>
>>> If epoll_wait is called on the inner fd between steps 4 and 5, the
>>> epoll_wait in step 5 will then block as expected.
>>>
>>> Does not seem to matter if the event is added to the inner epoll set
>>> with EPOLLET set or not.
>>>
>>> [3.] Keywords (i.e., modules, networking, kernel): epoll, epoll_wait,
>>> edge triggering
>>>
>>> [4.] Kernel version (from /proc/version): 4.4.0-103-generic (gcc version 4.8.4)
>>>
>>> [6.] A small shell script or example program which triggers the
>>> problem (if possible)
>>>
>>
>> Interesting - it seems that epoll can excessively queue wakeup events
>> when not desired. Here's a small patch which cures this case, if you
>> want to try it out. The semantics around nested edge trigger, though do
>> seem unexpected. For example, in the test case you presented. If one
>> does epoll_wait() on the outer first and then the inner both
>> epoll_wait() will return, however if one does the inner first and then
>> the outer, only the inner will return an event. This has to do with how
>> epoll implements its polling, it seems odd as well and trickier to fix.
>> Afaict its always acted like this.
>>
>> Thanks,
>>
>> -Jason
>>
>> diff --git a/fs/eventpoll.c b/fs/eventpoll.c
>> index afd548e..6bd1f46 100644
>> --- a/fs/eventpoll.c
>> +++ b/fs/eventpoll.c
>> @@ -713,6 +713,7 @@ static int ep_scan_ready_list(struct eventpoll *ep,
>> if (!ep_is_linked(&epi->rdllink)) {
>> list_add_tail(&epi->rdllink, &ep->rdllist);
>> ep_pm_stay_awake(epi);
>> + pwake = 1;
>> }
>> }
>> /*
>> @@ -728,7 +729,8 @@ static int ep_scan_ready_list(struct eventpoll *ep,
>> list_splice(&txlist, &ep->rdllist);
>> __pm_relax(ep->ws);
>>
>> - if (!list_empty(&ep->rdllist)) {
>> + if (unlikely(pwake)) {
>> + pwake = 0;
>> /*
>> * Wake up (if active) both the eventpoll wait list and
>> * the ->poll() wait list (delayed after we release the
>> lock).
>> @@ -744,7 +746,7 @@ static int ep_scan_ready_list(struct eventpoll *ep,
>> mutex_unlock(&ep->mtx);
>>
>> /* We have to call this outside the lock */
>> - if (pwake)
>> + if (unlikely(pwake))
>> ep_poll_safewake(&ep->poll_wait);
>>
>> return error;
>>
>>
>>> #include <fcntl.h>
>>> #include <stdio.h>
>>> #include <sys/epoll.h>
>>> #include <sys/eventfd.h>
>>> #include <unistd.h>
>>>
>>> int main(int argc, char** argv) {
>>> struct epoll_event ev, events[1];
>>> int inner_ep, outer_ep, sem, flags, ret;
>>> long long val = 1;
>>>
>>> if ((sem = eventfd(0, 0)) < 0) {
>>> fprintf(stderr, "eventfd failed");
>>> return -1;
>>> }
>>>
>>> if ((inner_ep = epoll_create(1)) < 0) {
>>> fprintf(stderr, "inner epoll_create failed");
>>> return -1;
>>> }
>>>
>>> // Set inner to be non-blocking (probably irrelevant, but...)
>>> if ((flags = fcntl(inner_ep, F_GETFL, 0)) < 0) {
>>> fprintf(stderr, "fcntl get failed");
>>> return -1;
>>> }
>>> flags |= O_NONBLOCK;
>>> if (fcntl(inner_ep, F_SETFL, flags) < 0) {
>>> fprintf(stderr, "fcntl set failed");
>>> return -1;
>>> }
>>>
>>> // Add the event to the inner epoll instance.
>>> ev.events = EPOLLIN | EPOLLET;
>>> ev.data.fd = sem;
>>> if (epoll_ctl(inner_ep, EPOLL_CTL_ADD, sem, &ev) < 0) {
>>> fprintf(stderr, "inner add failed");
>>> return -1;
>>> }
>>>
>>> if ((outer_ep = epoll_create(1)) < 0) {
>>> fprintf(stderr, "outer epoll_create failed");
>>> return -1;
>>> }
>>>
>>> // Add the inner epoll instance to the outer. Note the EPOLLET!
>>> ev.events = EPOLLIN | EPOLLET;
>>> ev.data.fd = inner_ep;
>>> if (epoll_ctl(outer_ep, EPOLL_CTL_ADD, inner_ep, &ev) < 0) {
>>> fprintf(stderr, "outer add failed");
>>> return -1;
>>> }
>>>
>>> // Write to the event.
>>> if (write(sem, &val, sizeof(val)) != sizeof(val)) {
>>> fprintf(stderr, "write failed");
>>> return -1;
>>> }
>>>
>>> // This should return immediately.
>>> printf("First wait.\n");
>>> if ((ret = epoll_wait(outer_ep, events, 1, 10000)) < 0) {
>>> fprintf(stderr, "epoll_wait failed");
>>> return -1;
>>> }
>>> printf("...returned %d.\n", ret);
>>>
>>> // This should _not_ return immediately if edge triggered!
>>> printf("Second wait.");
>>> if ((ret = epoll_wait(outer_ep, events, 1, 10000)) < 0) {
>>> fprintf(stderr, "epoll_wait failed");
>>> return -1;
>>> }
>>> printf("...returned %d.\n", ret);
>>>
>>> printf("Done.\n");
>>>
>>> return 0;
>>> }
>>>
>>> [7.] Environment
>>> (Should not be sensitive to hardware or kernel version, I don't
>>> think...a basic flaw with epoll_wait logic)
>>>