Re: [PATCH net-next v6 0/4] Per epoll context busy poll support

From: Stanislav Fomichev
Date: Tue Feb 06 2024 - 13:52:36 EST


On 02/05, Joe Damato wrote:
> Greetings:
>
> Welcome to v6.
>
> TL;DR This builds on commit bf3b9f6372c4 ("epoll: Add busy poll support to
> epoll with socket fds.") by allowing user applications to enable
> epoll-based busy polling, set a busy poll packet budget, and enable or
> disable prefer busy poll on a per epoll context basis.
>
> This makes epoll-based busy polling much more usable for user
> applications than the current system-wide sysctl and hardcoded budget.
>
> To allow for this, two ioctls have been added for epoll contexts for
> getting and setting a new struct, struct epoll_params.
>
> ioctl was chosen vs a new syscall after reviewing a suggestion by Willem
> de Bruijn [1]. I am open to using a new syscall instead of an ioctl, but it
> seemed that:
> - Busy poll affects all existing epoll_wait and epoll_pwait variants in
> the same way, so new verions of many syscalls might be needed. It
> seems much simpler for users to use the correct
> epoll_wait/epoll_pwait for their app and add a call to ioctl to enable
> or disable busy poll as needed. This also probably means less work to
> get an existing epoll app using busy poll.
>
> - previously added epoll_pwait2 helped to bring epoll closer to
> existing syscalls (like pselect and ppoll) and this busy poll change
> reflected as a new syscall would not have the same effect.
>
> Note: patch 1/4 as of v4 uses an or (||) instead of an xor. I thought about
> it some more and I realized that if the user enables both the per-epoll
> context setting and the system wide sysctl, then busy poll should be
> enabled and not disabled. Using xor doesn't seem to make much sense after
> thinking through this a bit.
>
> Longer explanation:
>
> Presently epoll has support for a very useful form of busy poll based on
> the incoming NAPI ID (see also: SO_INCOMING_NAPI_ID [2]).
>
> This form of busy poll allows epoll_wait to drive NAPI packet processing
> which allows for a few interesting user application designs which can
> reduce latency and also potentially improve L2/L3 cache hit rates by
> deferring NAPI until userland has finished its work.
>
> The documentation available on this is, IMHO, a bit confusing so please
> allow me to explain how one might use this:
>
> 1. Ensure each application thread has its own epoll instance mapping
> 1-to-1 with NIC RX queues. An n-tuple filter would likely be used to
> direct connections with specific dest ports to these queues.
>
> 2. Optionally: Setup IRQ coalescing for the NIC RX queues where busy
> polling will occur. This can help avoid the userland app from being
> pre-empted by a hard IRQ while userland is running. Note this means that
> userland must take care to call epoll_wait and not take too long in
> userland since it now drives NAPI via epoll_wait.
>
> 3. Optionally: Consider using napi_defer_hard_irqs and gro_flush_timeout to
> further restrict IRQ generation from the NIC. These settings are
> system-wide so their impact must be carefully weighed against the running
> applications.
>
> 4. Ensure that all incoming connections added to an epoll instance
> have the same NAPI ID. This can be done with a BPF filter when
> SO_REUSEPORT is used or getsockopt + SO_INCOMING_NAPI_ID when a single
> accept thread is used which dispatches incoming connections to threads.
>
> 5. Lastly, busy poll must be enabled via a sysctl
> (/proc/sys/net/core/busy_poll).
>
> Please see Eric Dumazet's paper about busy polling [3] and a recent
> academic paper about measured performance improvements of busy polling [4]
> (albeit with a modification that is not currently present in the kernel)
> for additional context.
>
> The unfortunate part about step 5 above is that this enables busy poll
> system-wide which affects all user applications on the system,
> including epoll-based network applications which were not intended to
> be used this way or applications where increased CPU usage for lower
> latency network processing is unnecessary or not desirable.
>
> If the user wants to run one low latency epoll-based server application
> with epoll-based busy poll, but would like to run the rest of the
> applications on the system (which may also use epoll) without busy poll,
> this system-wide sysctl presents a significant problem.
>
> This change preserves the system-wide sysctl, but adds a mechanism (via
> ioctl) to enable or disable busy poll for epoll contexts as needed by
> individual applications, making epoll-based busy poll more usable.
>
> Note that this change includes an or (as of v4) instead of an xor. If the
> user has enabled both the system-wide sysctl and also the per epoll-context
> busy poll settings, then epoll should probably busy poll (vs being
> disabled).
>
> Thanks,
> Joe
>
> v5 -> v6:
> - patch 1/3 no functional change, but commit message corrected to explain
> that an or (||) is being used instead of xor.
>
> - patch 3/4 is a new patch which adds support for per epoll context
> prefer busy poll setting.
>
> - patch 4/4 updated to allow getting/setting per epoll context prefer
> busy poll setting; this setting is limited to either 0 or 1.
>
> v4 -> v5:
> - patch 3/3 updated to use memchr_inv to ensure that __pad is zero for
> the EPIOCSPARAMS ioctl. Recommended by Greg K-H [5], Dave Chinner [6],
> and Jiri Slaby [7].
>
> v3 -> v4:
> - patch 1/3 was updated to include an important functional change:
> ep_busy_loop_on was updated to use or (||) instead of xor (^). After
> thinking about it a bit more, I thought xor didn't make much sense.
> Enabling both the per-epoll context and the system-wide sysctl should
> probably enable busy poll, not disable it. So, or (||) makes more
> sense, I think.
>
> - patch 3/3 was updated:
> - to change the epoll_params fields to be __u64, __u16, and __u8 and
> to pad the struct to a multiple of 64bits. Suggested by Greg K-H [8]
> and Arnd Bergmann [9].
> - remove an unused pr_fmt, left over from the previous revision.
> - ioctl now returns -EINVAL when epoll_params.busy_poll_usecs >
> U32_MAX.
>
> v2 -> v3:
> - cover letter updated to mention why ioctl seems (to me) like a better
> choice vs a new syscall.
>
> - patch 3/4 was modified in 3 ways:
> - when an unknown ioctl is received, -ENOIOCTLCMD is returned instead
> of -EINVAL as the ioctl documentation requires.
> - epoll_params.busy_poll_budget can only be set to a value larger than
> NAPI_POLL_WEIGHT if code is run by privileged (CAP_NET_ADMIN) users.
> Otherwise, -EPERM is returned.
> - busy poll specific ioctl code moved out to its own function. On
> kernels without busy poll support, -EOPNOTSUPP is returned. This also
> makes the kernel build robot happier without littering the code with
> more #ifdefs.
>
> - dropped patch 4/4 after Eric Dumazet's review of it when it was sent
> independently to the list [10].
>
> v1 -> v2:
> - cover letter updated to make a mention of napi_defer_hard_irqs and
> gro_flush_timeout as an added step 3 and to cite both Eric Dumazet's
> busy polling paper and a paper from University of Waterloo for
> additional context. Specifically calling out the xor in patch 1/4
> incase it is missed by reviewers.
>
> - Patch 2/4 has its commit message updated, but no functional changes.
> Commit message now describes that allowing for a settable budget helps
> to improve throughput and is more consistent with other busy poll
> mechanisms that allow a settable budget via SO_BUSY_POLL_BUDGET.
>
> - Patch 3/4 was modified to check if the epoll_params.busy_poll_budget
> exceeds NAPI_POLL_WEIGHT. The larger value is allowed, but an error is
> printed. This was done for consistency with netif_napi_add_weight,
> which does the same.
>
> - Patch 3/4 the struct epoll_params was updated to fix the type of the
> data field; it was uint8_t and was changed to u8.
>
> - Patch 4/4 added to check if SO_BUSY_POLL_BUDGET exceeds
> NAPI_POLL_WEIGHT. The larger value is allowed, but an error is
> printed. This was done for consistency with netif_napi_add_weight,
> which does the same.
>
> [1]: https://lore.kernel.org/lkml/65b1cb7f73a6a_250560294bd@xxxxxxxxxxxxxxxxxxxxxx.notmuch/
> [2]: https://lore.kernel.org/lkml/20170324170836.15226.87178.stgit@localhost.localdomain/
> [3]: https://netdevconf.info/2.1/papers/BusyPollingNextGen.pdf
> [4]: https://dl.acm.org/doi/pdf/10.1145/3626780
> [5]: https://lore.kernel.org/lkml/2024013001-prison-strum-899d@gregkh/
> [6]: https://lore.kernel.org/lkml/Zbm3AXgcwL9D6TNM@xxxxxxxxxxxxxxxxxxx/
> [7]: https://lore.kernel.org/lkml/efee9789-4f05-4202-9a95-21d88f6307b0@xxxxxxxxxx/
> [8]: https://lore.kernel.org/lkml/2024012551-anyone-demeaning-867b@gregkh/
> [9]: https://lore.kernel.org/lkml/57b62135-2159-493d-a6bb-47d5be55154a@xxxxxxxxxxxxxxxx/
> [10]: https://lore.kernel.org/lkml/CANn89i+uXsdSVFiQT9fDfGw+h_5QOcuHwPdWi9J=5U6oLXkQTA@xxxxxxxxxxxxxx/
>
> Joe Damato (4):
> eventpoll: support busy poll per epoll instance
> eventpoll: Add per-epoll busy poll packet budget
> eventpoll: Add per-epoll prefer busy poll option
> eventpoll: Add epoll ioctl for epoll_params
>
> .../userspace-api/ioctl/ioctl-number.rst | 1 +
> fs/eventpoll.c | 136 +++++++++++++++++-
> include/uapi/linux/eventpoll.h | 13 ++
> 3 files changed, 144 insertions(+), 6 deletions(-)

Coincidentally, we were looking into the same area and your patches are
super useful :-) Thank you for plumbing in prefer_busy_poll.

Acked-by: Stanislav Fomichev <sdf@xxxxxxxxxx>