[RFC PATCH v5 0/2] eventfd: add configurable maximum counter value for flow control
From: wen . yang
Date: Wed Apr 08 2026 - 13:27:52 EST
From: Wen Yang <wen.yang@xxxxxxxxx>
eventfd's counter is bounded only by ULLONG_MAX (~1.8x10^19). In
non-semaphore mode a fast producer can write continuously while a slow
consumer falls behind: the producer never stalls, the counter grows
without limit, both sides burn CPU at 100%, and consumer lag is
invisible. There is no mechanism to apply back-pressure.
Add EFD_IOC_SET_MAXIMUM and EFD_IOC_GET_MAXIMUM ioctl commands that
set a configurable overflow threshold. A write(2) that would push the
counter to or beyond maximum blocks (EAGAIN for O_NONBLOCK fds). The
kernel-internal eventfd_signal() path may still raise the counter to
maximum (EPOLLERR), preserving the original overflow semantics. The
default is ULLONG_MAX, preserving backward compatibility.
This follows the back-pressure pattern already established in the
kernel: pipe(2) writers block when the buffer is full, capacity is
tunable via fcntl(F_SETPIPE_SZ); mq_send(3) blocks when the queue
depth reaches mq_maxmsg. EFD_IOC_SET_MAXIMUM applies the same
pattern to eventfd.
Measured on a 4-core x86_64, writer and reader pinned to separate CPUs,
reader sleeps 1 ms between reads to simulate processing time:
Bench 1 - burst/CPU (5 s, blocking write)
maximum wcpu_ms rcpu_ms EAGAIN writes reads
--------------------------------------------------------------
ULLONG_MAX 5002 132 0 6517388 4506
10 133 150 0 40456 4496
(O_NONBLOCK+spin bypasses flow control; use O_NONBLOCK+poll(POLLOUT)
to avoid wasting CPU on EAGAIN retries while still multiplexing fds)
Bench 2 - latency tail (EFD_SEMAPHORE, 10 K/s writer, ~8 K/s reader,
5000 events)
maximum p99_us p999_us max_us
----------------------------------------
ULLONG_MAX 141218 142477 142588
10 1719 2378 2381
Bench 3 - coalescing (non-EFD_SEMAPHORE, 10000 writes, 125 us/read
reader; each read drains the full counter)
maximum writes reads avg_batch
-----------------------------------------
ULLONG_MAX 10000 79 126.6
10 10000 1121 8.9
With maximum=10: burst CPU drops >97% (5002 ms -> 133 ms); latency p999
drops ~60x (142 ms -> 2.4 ms); coalescing batch bounded to 9 vs 127,
so the consumer always knows the backlog is small.
Notes:
- Magic 'J': 'E' conflicts with linux/input.h and xen/evtchn.h; 'J' is
unregistered, added to ioctl-number.rst.
- Command numbers 0/1: explicit distinct numbers are clearer than
relying solely on direction bits to disambiguate SET from GET.
- .compat_ioctl = compat_ptr_ioctl handles 32-bit user pointers.
- Writers woken on SET_MAXIMUM: a raised limit takes effect immediately
without waiting for the next read(2).
Changes since v4
(https://lore.kernel.org/all/20250310051832.5658-1-wen.yang@xxxxxxxxx/)
- Use ioctl magic 'J' instead of 'E' (conflict with input.h/xen).
- Add .compat_ioctl = compat_ptr_ioctl.
- Expose eventfd-maximum in /proc/self/fdinfo.
- Return -ENOTTY for unrecognised ioctl commands (was -ENOENT).
- Remove the unnecessary !argp guard in eventfd_ioctl().
- Register magic 'J' in Documentation/userspace-api/ioctl/ioctl-number.rst.
- Add kselftest correctness tests.
Wen Yang (2):
eventfd: add configurable per-fd counter maximum for flow control
selftests/eventfd: add EFD_IOC_{SET,GET}_MAXIMUM tests
.../userspace-api/ioctl/ioctl-number.rst | 1 +
fs/eventfd.c | 74 +++++-
include/uapi/linux/eventfd.h | 6 +
.../filesystems/eventfd/eventfd_test.c | 238 +++++++++++++++++-
4 files changed, 306 insertions(+), 13 deletions(-)
--
2.25.1