Weird issue with epoll and kernel >= 5.0
From: Omar Kilani
Date: Sat Mar 28 2020 - 14:12:28 EST
Hi there,
I've observed an issue with epoll and kernels 5.0 and above when a
system is generating a lot of epoll events.
I see this issue with nginx and jvm / netty based apps (using the
jvm's native epoll support as well as netty's own optimized epoll
support) but *not* with haproxy (?).
I'm not really sure what the actual problem is (nginx complains about
epoll_wait with a generic error), but it doesn't happen on 4.19.x and
lower.
I thought it was a netty problem at first and opened this ticket:
https://github.com/netty/netty/issues/8999
But then saw the same issue in nginx.
I haven't debugged a kernel issue in something like 20 years so I'm
not really sure where to start myself.
I'd be more than happy to provide my test case that has a very quick
repro to anyone who needs it.
Also happy to provide a VM/machine with enough CPUs to trigger it
easily (it seems to happen quicker with more CPUs present) to test
with.
Thanks!
Regards,
Omar