Re: Weird issue with epoll and kernel >= 5.0

From: Davidlohr Bueso
Date: Tue Mar 31 2020 - 14:12:56 EST


On Sat, 28 Mar 2020, Randy Dunlap wrote:

On 3/28/20 11:10 AM, Omar Kilani wrote:
Hi there,

I've observed an issue with epoll and kernels 5.0 and above when a
system is generating a lot of epoll events.

I see this issue with nginx and jvm / netty based apps (using the
jvm's native epoll support as well as netty's own optimized epoll
support) but *not* with haproxy (?).

I'm not really sure what the actual problem is (nginx complains about
epoll_wait with a generic error), but it doesn't happen on 4.19.x and
lower.

I thought it was a netty problem at first and opened this ticket:

https://github.com/netty/netty/issues/8999

But then saw the same issue in nginx.

I haven't debugged a kernel issue in something like 20 years so I'm
not really sure where to start myself.

I'd be more than happy to provide my test case that has a very quick
repro to anyone who needs it.

Hi,
Please do.

Also happy to provide a VM/machine with enough CPUs to trigger it
easily (it seems to happen quicker with more CPUs present) to test
with.

Yeah, more than a VM, an actual reproducer would be much welcome here.



There have been around 10 changes in fs/eventpoll.c since v5.0 was
released in March, 2019, so it would be helpful if you could test
the latest mainline kernel to see if the problem is still present.

Hm, it looks like you have identified this commit:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=v5.1-rc5&id=c5a282e9635e9c7382821565083db5d260085e3e
as the/a problem.

As this been bisected down to this? As you mention there are more
commits in there that are dependent of each other, so I'd like
to be certain this is actually the broken change.

Thanks,
Davidlohr