Re: [PATCH RESEND] fs/epoll: fix the edge-triggered mode for nested epoll

From: Jason Baron
Date: Thu Sep 05 2019 - 13:48:55 EST




On 9/5/19 1:27 PM, Roman Penyaev wrote:
> On 2019-09-05 11:56, Heiher wrote:
>> Hi,
>>
>> On Thu, Sep 5, 2019 at 10:53 AM Heiher <r@xxxxxx> wrote:
>>>
>>> Hi,
>>>
>>> I created an epoll wakeup test project, listed some possible cases,
>>> and any other corner cases needs to be added?
>>>
>>> https://github.com/heiher/epoll-wakeup/blob/master/README.md
>>>
>>> On Wed, Sep 4, 2019 at 10:02 PM Heiher <r@xxxxxx> wrote:
>>> >
>>> > Hi,
>>> >
>>> > On Wed, Sep 4, 2019 at 8:02 PM Jason Baron <jbaron@xxxxxxxxxx> wrote:
>>> > >
>>> > >
>>> > >
>>> > > On 9/4/19 5:57 AM, Roman Penyaev wrote:
>>> > > > On 2019-09-03 23:08, Jason Baron wrote:
>>> > > >> On 9/2/19 11:36 AM, Roman Penyaev wrote:
>>> > > >>> Hi,
>>> > > >>>
>>> > > >>> This is indeed a bug. (quick side note: could you please
>>> remove efd[1]
>>> > > >>> from your test, because it is not related to the reproduction
>>> of a
>>> > > >>> current bug).
>>> > > >>>
>>> > > >>> Your patch lacks a good description, what exactly you've
>>> fixed. Let
>>> > > >>> me speak out loud and please correct me if I'm wrong, my
>>> understanding
>>> > > >>> of epoll internals has become a bit rusty: when epoll fds are
>>> nested
>>> > > >>> an attempt to harvest events (ep_scan_ready_list() call)
>>> produces a
>>> > > >>> second (repeated) event from an internal fd up to an external
>>> fd:
>>> > > >>>
>>> > > >>>ÂÂÂÂÂ epoll_wait(efd[0], ...):
>>> > > >>>ÂÂÂÂÂÂÂ ep_send_events():
>>> > > >>>ÂÂÂÂÂÂÂÂÂÂ ep_scan_ready_list(depth=0):
>>> > > >>>ÂÂÂÂÂÂÂÂÂÂÂÂ ep_send_events_proc():
>>> > > >>>ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ ep_item_poll():
>>> > > >>>ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ ep_scan_ready_list(depth=1):
>>> > > >>>ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ ep_poll_safewake():
>>> > > >>>ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ ep_poll_callback()
>>> > > >>>ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ list_add_tail(&epi, &epi->rdllist);
>>> > > >>>ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ ^^^^^^
>>> > > >>>ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ repeated event
>>> > > >>>
>>> > > >>>
>>> > > >>> In your patch you forbid wakeup for the cases, where depth !=
>>> 0, i.e.
>>> > > >>> for all nested cases. That seems clear. But what if we can
>>> go further
>>> > > >>> and remove the whole chunk, which seems excessive:
>>> > > >>>
>>> > > >>> @@ -885,26 +886,11 @@ static __poll_t ep_scan_ready_list(struct
>>> > > >>> eventpoll *ep,
>>> > > >>>
>>> > > >>> -
>>> > > >>> -ÂÂÂÂÂÂ if (!list_empty(&ep->rdllist)) {
>>> > > >>> -ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ /*
>>> > > >>> -ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ * Wake up (if active) both the eventpoll
>>> wait list and
>>> > > >>> -ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ * the ->poll() wait list (delayed after we
>>> release the
>>> > > >>> lock).
>>> > > >>> -ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ */
>>> > > >>> -ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ if (waitqueue_active(&ep->wq))
>>> > > >>> -ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ wake_up(&ep->wq);
>>> > > >>> -ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ if (waitqueue_active(&ep->poll_wait))
>>> > > >>> -ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ pwake++;
>>> > > >>> -ÂÂÂÂÂÂ }
>>> > > >>>ÂÂÂÂÂÂÂÂ write_unlock_irq(&ep->lock);
>>> > > >>>
>>> > > >>>ÂÂÂÂÂÂÂÂ if (!ep_locked)
>>> > > >>>ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ mutex_unlock(&ep->mtx);
>>> > > >>>
>>> > > >>> -ÂÂÂÂÂÂ /* We have to call this outside the lock */
>>> > > >>> -ÂÂÂÂÂÂ if (pwake)
>>> > > >>> -ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ ep_poll_safewake(&ep->poll_wait);
>>> > > >>>
>>> > > >>>
>>> > > >>> I reason like that: by the time we've reached the point of
>>> scanning events
>>> > > >>> for readiness all wakeups from ep_poll_callback have been
>>> already fired and
>>> > > >>> new events have been already accounted in ready list
>>> (ep_poll_callback()
>>> > > >>> calls
>>> > > >>> the same ep_poll_safewake()). Here, frankly, I'm not 100%
>>> sure and probably
>>> > > >>> missing some corner cases.
>>> > > >>>
>>> > > >>> Thoughts?
>>> > > >>
>>> > > >> So the: 'wake_up(&ep->wq);' part, I think is about waking up
>>> other
>>> > > >> threads that may be in waiting in epoll_wait(). For example,
>>> there may
>>> > > >> be multiple threads doing epoll_wait() on the same epoll fd,
>>> and the
>>> > > >> logic above seems to say thread 1 may have processed say N
>>> events and
>>> > > >> now its going to to go off to work those, so let's wake up
>>> thread 2 now
>>> > > >> to handle the next chunk.
>>> > > >
>>> > > > Not quite. Thread which calls ep_scan_ready_list() processes
>>> all the
>>> > > > events, and while processing those, removes them one by one
>>> from the
>>> > > > ready list. But if event mask is !0 and event belongs to
>>> > > > Level Triggered Mode descriptor (let's say default mode) it
>>> tails event
>>> > > > again back to the list (because we are in level mode, so event
>>> should
>>> > > > be there). So at the end of this traversing loop ready list is
>>> likely
>>> > > > not empty, and if so, wake up again is called for nested epoll
>>> fds.
>>> > > > But, those nested epoll fds should get already all the
>>> notifications
>>> > > > from the main event callback ep_poll_callback(), regardless any
>>> thread
>>> > > > which traverses events.
>>> > > >
>>> > > > I suppose this logic exists for decades, when Davide (the
>>> author) was
>>> > > > reshuffling the code here and there.
>>> > > >
>>> > > > But I do not feel confidence to state that this extra wakeup is
>>> bogus,
>>> > > > I just have a gut feeling that it looks excessive.
>>> > >
>>> > > Note that I was talking about the wakeup done on ep->wq not
>>> ep->poll_wait.
>>> > > The path that I'm concerned about is let's say that there are N
>>> events
>>> > > queued on the ready list. A thread that was woken up in
>>> epoll_wait may
>>> > > decide to only process say N/2 of then. Then it will call wakeup
>>> on ep->wq
>>> > > and this will wakeup another thread to process the remaining N/2.
>>> Without
>>> > > the wakeup, the original thread isn't going to process the events
>>> until
>>> > > it finishes with the original N/2 and gets back to epoll_wait().
>>> So I'm not
>>> > > sure how important that path is but I wanted to at least note the
>>> change
>>> > > here would impact that behavior.
>>> > >
>>> > > Thanks,
>>> > >
>>> > > -Jason
>>> > >
>>> > >
>>> > > >
>>> > > >> So I think removing all that even for the
>>> > > >> depth 0 case is going to change some behavior here. So
>>> perhaps, it
>>> > > >> should be removed for all depths except for 0? And if so, it
>>> may be
>>> > > >> better to make 2 patches here to separate these changes.
>>> > > >>
>>> > > >> For the nested wakeups, I agree that the extra wakeups seem
>>> unnecessary
>>> > > >> and it may make sense to remove them for all depths. I don't
>>> think the
>>> > > >> nested epoll semantics are particularly well spelled out, and
>>> afaict,
>>> > > >> nested epoll() has behaved this way for quite some time. And
>>> the current
>>> > > >> behavior is not bad in the way that a missing wakeup or false
>>> negative
>>> > > >> would be.
>>> > > >
>>> > > > That's 100% true! For edge mode extra wake up is not a bug, not
>>> optimal
>>> > > > for userspace - yes, but that can't lead to any lost wakeups.
>>> > > >
>>> > > > --
>>> > > > Roman
>>> > > >
>>> >
>>> > I tried to remove the whole chunk of code that Roman said, and it
>>> > seems that there
>>> > are no obvious problems with the two test programs below:
>>
>> I recall this message, the test case 9/25/26 of epoll-wakeup (on
>> github) are failed while
>> the whole chunk are removed.
>>
>> Apply the original patch, all tests passed.
>
>
> These are failing on my bare 5.2.0-rc2
>
> TESTÂ bin/epoll31ÂÂÂÂÂÂ FAIL
> TESTÂ bin/epoll46ÂÂÂÂÂÂ FAIL
> TESTÂ bin/epoll50ÂÂÂÂÂÂ FAIL
> TESTÂ bin/epoll32ÂÂÂÂÂÂ FAIL
> TESTÂ bin/epoll19ÂÂÂÂÂÂ FAIL
> TESTÂ bin/epoll27ÂÂÂÂÂÂ FAIL
> TESTÂ bin/epoll42ÂÂÂÂÂÂ FAIL
> TESTÂ bin/epoll34ÂÂÂÂÂÂ FAIL
> TESTÂ bin/epoll48ÂÂÂÂÂÂ FAIL
> TESTÂ bin/epoll40ÂÂÂÂÂÂ FAIL
> TESTÂ bin/epoll20ÂÂÂÂÂÂ FAIL
> TESTÂ bin/epoll28ÂÂÂÂÂÂ FAIL
> TESTÂ bin/epoll38ÂÂÂÂÂÂ FAIL
> TESTÂ bin/epoll52ÂÂÂÂÂÂ FAIL
> TESTÂ bin/epoll24ÂÂÂÂÂÂ FAIL
> TESTÂ bin/epoll23ÂÂÂÂÂÂ FAIL
>
>
> These are failing if your patch is applied:
> (my 5.2.0-rc2 is old? broken?)
>
> TESTÂ bin/epoll46ÂÂÂÂÂÂ FAIL
> TESTÂ bin/epoll42ÂÂÂÂÂÂ FAIL
> TESTÂ bin/epoll34ÂÂÂÂÂÂ FAIL
> TESTÂ bin/epoll48ÂÂÂÂÂÂ FAIL
> TESTÂ bin/epoll40ÂÂÂÂÂÂ FAIL
> TESTÂ bin/epoll44ÂÂÂÂÂÂ FAIL
> TESTÂ bin/epoll38ÂÂÂÂÂÂ FAIL
>
> These are failing if "ep_poll_safewake(&ep->poll_wait)" is not called,
> but wakeup(&ep->wq); is still invoked:
>
> TESTÂ bin/epoll46ÂÂÂÂÂÂ FAIL
> TESTÂ bin/epoll42ÂÂÂÂÂÂ FAIL
> TESTÂ bin/epoll34ÂÂÂÂÂÂ FAIL
> TESTÂ bin/epoll40ÂÂÂÂÂÂ FAIL
> TESTÂ bin/epoll44ÂÂÂÂÂÂ FAIL
> TESTÂ bin/epoll38ÂÂÂÂÂÂ FAIL
>
> So at least 48 has been "fixed".
>
> These are failing if the whole chunk is removed, like your
> said 9,25,26 are among which do not pass:
>
> TESTÂ bin/epoll26ÂÂÂÂÂÂ FAIL
> TESTÂ bin/epoll42ÂÂÂÂÂÂ FAIL
> TESTÂ bin/epoll34ÂÂÂÂÂÂ FAIL
> TESTÂ bin/epoll9ÂÂÂÂÂÂÂ FAIL
> TESTÂ bin/epoll48ÂÂÂÂÂÂ FAIL
> TESTÂ bin/epoll40ÂÂÂÂÂÂ FAIL
> TESTÂ bin/epoll25ÂÂÂÂÂÂ FAIL
> TESTÂ bin/epoll44ÂÂÂÂÂÂ FAIL
> TESTÂ bin/epoll38ÂÂÂÂÂÂ FAIL
>
> This can be a good test suite, probably can be added to kselftests?
>
> --
> Roman
>


Indeed, I just tried the same test suite and I am seeing similar
failures - it looks like its a bit timing dependent. It looks like all
the failures are caused by a similar issue. For example, take epoll34:

t0 t1
(ew) | | (ew)
e0 |
(lt) \ /
|
e1
| (et)
s0


The test is trying to assert that an epoll_wait() on e1 and and
epoll_wait() on e0 both return 1 event for EPOLLIN. However, the
epoll_wait on e1 is done in a thread and it can happen before or after
the epoll_wait() is called against e0. If the epoll_wait() on e1 happens
first then because its attached as 'et', it consumes the event. So that
there is no longer an event reported at e0. I think that is reasonable
semantics here. However if the wait on e0 happens after the wait on e1
then the test will pass as both waits will see the event. Thus, a patch
like this will 'fix' this testcase:

--- a/src/epoll34.c
+++ b/src/epoll34.c
@@ -59,15 +59,15 @@ int main(int argc, char *argv[])
if (epoll_ctl(efd[0], EPOLL_CTL_ADD, efd[1], &e) < 0)
goto out;

- if (pthread_create(&tw, NULL, thread_handler, NULL) < 0)
- goto out;
-
if (pthread_create(&te, NULL, emit_handler, NULL) < 0)
goto out;

if (epoll_wait(efd[0], &e, 1, 500) == 1)
count++;

+ if (pthread_create(&tw, NULL, thread_handler, NULL) < 0)
+ goto out;
+
if (pthread_join(tw, NULL) < 0)
goto out;


I found all the other failures to be of similar origin. I suspect Heiher
didn't see failures due to the thread timings here.

I also found that all the testcases pass if we leave the wakeup(&ep->wq)
call for the depth 0 case (and remove the pwake part).

Thanks,

-Jason