Re: [net-next 0/3] Per epoll context busy poll support

From: Eric Dumazet
Date: Wed Jan 24 2024 - 09:51:49 EST


On Wed, Jan 24, 2024 at 3:20 PM Joe Damato <jdamato@xxxxxxxxxx> wrote:
>
> On Wed, Jan 24, 2024 at 09:20:09AM +0100, Eric Dumazet wrote:
> > On Wed, Jan 24, 2024 at 3:54 AM Joe Damato <jdamato@xxxxxxxxxx> wrote:
> > >
> > > Greetings:
> > >
> > > TL;DR This builds on commit bf3b9f6372c4 ("epoll: Add busy poll support to
> > > epoll with socket fds.") by allowing user applications to enable
> > > epoll-based busy polling and set a busy poll packet budget on a per epoll
> > > context basis.
> > >
> > > To allow for this, two ioctls have been added for epoll contexts for
> > > getting and setting a new struct, struct epoll_params.
> > >
> > > This makes epoll-based busy polling much more usable for user
> > > applications than the current system-wide sysctl and hardcoded budget.
> > >
> > > Longer explanation:
> > >
> > > Presently epoll has support for a very useful form of busy poll based on
> > > the incoming NAPI ID (see also: SO_INCOMING_NAPI_ID [1]).
> > >
> > > This form of busy poll allows epoll_wait to drive NAPI packet processing
> > > which allows for a few interesting user application designs which can
> > > reduce latency and also potentially improve L2/L3 cache hit rates by
> > > deferring NAPI until userland has finished its work.
> > >
> > > The documentation available on this is, IMHO, a bit confusing so please
> > > allow me to explain how one might use this:
> > >
> > > 1. Ensure each application thread has its own epoll instance mapping
> > > 1-to-1 with NIC RX queues. An n-tuple filter would likely be used to
> > > direct connections with specific dest ports to these queues.
> > >
> > > 2. Optionally: Setup IRQ coalescing for the NIC RX queues where busy
> > > polling will occur. This can help avoid the userland app from being
> > > pre-empted by a hard IRQ while userland is running. Note this means that
> > > userland must take care to call epoll_wait and not take too long in
> > > userland since it now drives NAPI via epoll_wait.
> > >
> > > 3. Ensure that all incoming connections added to an epoll instance
> > > have the same NAPI ID. This can be done with a BPF filter when
> > > SO_REUSEPORT is used or getsockopt + SO_INCOMING_NAPI_ID when a single
> > > accept thread is used which dispatches incoming connections to threads.
> > >
> > > 4. Lastly, busy poll must be enabled via a sysctl
> > > (/proc/sys/net/core/busy_poll).
> > >
> > > The unfortunate part about step 4 above is that this enables busy poll
> > > system-wide which affects all user applications on the system,
> > > including epoll-based network applications which were not intended to
> > > be used this way or applications where increased CPU usage for lower
> > > latency network processing is unnecessary or not desirable.
> > >
> > > If the user wants to run one low latency epoll-based server application
> > > with epoll-based busy poll, but would like to run the rest of the
> > > applications on the system (which may also use epoll) without busy poll,
> > > this system-wide sysctl presents a significant problem.
> > >
> > > This change preserves the system-wide sysctl, but adds a mechanism (via
> > > ioctl) to enable or disable busy poll for epoll contexts as needed by
> > > individual applications, making epoll-based busy poll more usable.
> > >
> >
> > I think this description missed the napi_defer_hard_irqs and
> > gro_flush_timeout settings ?
>
> I'm not sure if those settings are strictly related to the change I am
> proposing which makes epoll-based busy poll something that can be
> enabled/disabled on a per-epoll context basis and allows the budget to be
> set as well, but maybe I am missing something? Sorry for my
> misunderstanding if so.
>
> IMHO: a single system-wide busy poll setting is difficult to use
> properly and it is unforunate that the packet budget is hardcoded. It would
> be extremely useful to be able to set both of these on a per-epoll basis
> and I think my suggested change helps to solve this.
>
> Please let me know.
>
> Re the two settings you noted:
>
> I didn't mention those in the interest of brevity, but yes they can be used
> instead of or in addition to what I've described above.
>
> While those settings are very useful, IMHO, they have their own issues
> because they are system-wide as well. If they were settable per-NAPI, that
> would make it much easier to use them because they could be enabled for the
> NAPIs which are being busy-polled by applications that support busy-poll.
>
> Imagine you have 3 types of apps running side-by-side:
> - A low latency epoll-based busy poll app,
> - An app where latency doesn't matter as much, and
> - A latency sensitive legacy app which does not yet support epoll-based
> busy poll.
>
> In the first two cases, the settings you mention would be helpful or not
> make any difference, but in the third case the system-wide impact might be
> undesirable because having IRQs fire might be important to keep latency
> down.
>
> If your comment was more that my cover letter should have mentioned these,
> I can include that in a future cover letter or suggest some kernel
> documentation which will discuss all of these features and how they relate
> to each other.
>
> >
> > I would think that if an application really wants to make sure its
> > thread is the only one
> > eventually calling napi->poll(), we must make sure NIC interrupts stay masked.
> >
> > Current implementations of busy poll always release NAPI_STATE_SCHED bit when
> > returning to user space.
> >
> > It seems you want to make sure the application and only the
> > application calls the napi->poll()
> > at chosen times.
> >
> > Some kind of contract is needed, and the presence of the hrtimer
> > (currently only driven from dev->@gro_flush_timeout)
> > would allow to do that correctly.
> >
> > Whenever we 'trust' user space to perform the napi->poll shortly, we
> > also want to arm the hrtimer to eventually detect
> > the application took too long, to restart the other mechanisms (NIC irq based)
>
> There is another change [1] I've been looking at from a research paper [2]
> which does something similar to what you've described above -- it keeps
> IRQs suppressed during busy polling. The paper suggests a performance
> improvement is measured when using a mechanism like this to keep IRQs off.
> Please see the paper for more details.
>
> I haven't had a chance to reach out to the authors or to tweak this patch
> to attempt an RFC / submission for it, but it seems fairly promising in my
> initial synthetic tests.
>
> When I tested their patch, as you might expect, no IRQs were generated at
> all for the NAPIs that were being busy polled, but the rest of the
> NAPIs and queues were generating IRQs as expected.
>
> Regardless of the above patch: I think my proposed change is helpful and
> the IRQ suppression bit can be handled in a separate change in the future.
> What do you think?
>
> > Note that we added the kthread based napi polling, and we are working
> > to add a busy polling feature to these kthreads.
> > allowing to completely mask NIC interrupts and further reduce latencies.
>
> I am aware of kthread based NAPI polling, yes, but I was not aware that
> busy polling was being considered as a feature for them, thanks for the
> head's up.
>
> > Thank you
>
> Thanks for your comments - I appreciate your time and attention.
>
> Could you let me know if your comments are meant as a n-ack or similar?

Patch #2 needs the 'why' part, and why would we allow user space to
ask to poll up to 65535 packets...
There is a reason we have a warning in place when a driver attempts to
set a budget bigger than 64.

You cited recent papers, I wrote this one specific to linux busy
polling ( https://netdevconf.info/2.1/papers/BusyPollingNextGen.pdf )

Busy polling had been in the pipe, when Wei sent her patches and follow ups.

cb038357937ee4f589aab2469ec3896dce90f317 net: fix race between napi
kthread mode and busy poll
5fdd2f0e5c64846bf3066689b73fc3b8dddd1c74 net: add sysfs attribute to
control napi threaded mode
29863d41bb6e1d969c62fdb15b0961806942960e net: implement threaded-able
napi poll loop support

I am saying that I am currently working to implement the kthread busy
polling implementation,
after fixing two bugs in SCTP and UDP (making me wondering if busy
polling is really used these days)

I am also considering unifying napi_threaded_poll() and the
napi_busy_loop(), and seeing your patches
coming make this work more challenging.