[PATCH] eventfd: Enlarge recursion limit to allow vhost to work
From: He Zhe
Date: Fri Jun 18 2021 - 04:47:38 EST
commit b5e683d5cab8 ("eventfd: track eventfd_signal() recursion depth")
introduces a percpu counter that tracks the percpu recursion depth and
warn if it greater than zero, to avoid potential deadlock and stack
overflow.
However sometimes different eventfds may be used in parallel. Specifically,
when heavy network load goes through kvm and vhost, working as below, it
would trigger the following call trace.
- 100.00%
- 66.51%
ret_from_fork
kthread
- vhost_worker
- 33.47% handle_tx_kick
handle_tx
handle_tx_copy
vhost_tx_batch.isra.0
vhost_add_used_and_signal_n
eventfd_signal
- 33.05% handle_rx_net
handle_rx
vhost_add_used_and_signal_n
eventfd_signal
- 33.49%
ioctl
entry_SYSCALL_64_after_hwframe
do_syscall_64
__x64_sys_ioctl
ksys_ioctl
do_vfs_ioctl
kvm_vcpu_ioctl
kvm_arch_vcpu_ioctl_run
vmx_handle_exit
handle_ept_misconfig
kvm_io_bus_write
__kvm_io_bus_write
eventfd_signal
001: WARNING: CPU: 1 PID: 1503 at fs/eventfd.c:73 eventfd_signal+0x85/0xa0
---- snip ----
001: Call Trace:
001: vhost_signal+0x15e/0x1b0 [vhost]
001: vhost_add_used_and_signal_n+0x2b/0x40 [vhost]
001: handle_rx+0xb9/0x900 [vhost_net]
001: handle_rx_net+0x15/0x20 [vhost_net]
001: vhost_worker+0xbe/0x120 [vhost]
001: kthread+0x106/0x140
001: ? log_used.part.0+0x20/0x20 [vhost]
001: ? kthread_park+0x90/0x90
001: ret_from_fork+0x35/0x40
001: ---[ end trace 0000000000000003 ]---
This patch enlarges the limit to 1 which is the maximum recursion depth we
have found so far.
The credit of modification for eventfd_signal_count goes to
Xie Yongji <xieyongji@xxxxxxxxxxxxx>
Signed-off-by: He Zhe <zhe.he@xxxxxxxxxxxxx>
---
fs/eventfd.c | 3 ++-
include/linux/eventfd.h | 5 ++++-
2 files changed, 6 insertions(+), 2 deletions(-)
diff --git a/fs/eventfd.c b/fs/eventfd.c
index e265b6dd4f34..add6af91cacf 100644
--- a/fs/eventfd.c
+++ b/fs/eventfd.c
@@ -71,7 +71,8 @@ __u64 eventfd_signal(struct eventfd_ctx *ctx, __u64 n)
* it returns true, the eventfd_signal() call should be deferred to a
* safe context.
*/
- if (WARN_ON_ONCE(this_cpu_read(eventfd_wake_count)))
+ if (WARN_ON_ONCE(this_cpu_read(eventfd_wake_count) >
+ EFD_WAKE_COUNT_MAX))
return 0;
spin_lock_irqsave(&ctx->wqh.lock, flags);
diff --git a/include/linux/eventfd.h b/include/linux/eventfd.h
index fa0a524baed0..74be152ebe87 100644
--- a/include/linux/eventfd.h
+++ b/include/linux/eventfd.h
@@ -29,6 +29,9 @@
#define EFD_SHARED_FCNTL_FLAGS (O_CLOEXEC | O_NONBLOCK)
#define EFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS | EFD_SEMAPHORE)
+/* This is the maximum recursion depth we find so far */
+#define EFD_WAKE_COUNT_MAX 1
+
struct eventfd_ctx;
struct file;
@@ -47,7 +50,7 @@ DECLARE_PER_CPU(int, eventfd_wake_count);
static inline bool eventfd_signal_count(void)
{
- return this_cpu_read(eventfd_wake_count);
+ return this_cpu_read(eventfd_wake_count) > EFD_WAKE_COUNT_MAX;
}
#else /* CONFIG_EVENTFD */
--
2.17.1