Re: [PATCH] Add provision to busyloop for events in ep_poll.

From: Martin Karsten
Date: Wed Sep 04 2024 - 08:47:35 EST


On 2024-09-04 01:52, Naman Gulati wrote:
Thanks all for the comments and apologies for the delay in replying.
Stan and Joe I’ve addressed some of the common concerns below.

On Thu, Aug 29, 2024 at 3:40 AM Joe Damato <jdamato@xxxxxxxxxx> wrote:

On Wed, Aug 28, 2024 at 06:10:11PM +0000, Naman Gulati wrote:
NAPI busypolling in ep_busy_loop loops on napi_poll and checks for new
epoll events after every napi poll. Checking just for epoll events in a
tight loop in the kernel context delivers latency gains to applications
that are not interested in napi busypolling with epoll.

This patch adds an option to loop just for new events inside
ep_busy_loop, guarded by the EPIOCSPARAMS ioctl that controls epoll napi
busypolling.

This makes an API change, so I think that linux-api@xxxxxxxxxxxxxxx
needs to be CC'd ?

A comparison with neper tcp_rr shows that busylooping for events in
epoll_wait boosted throughput by ~3-7% and reduced median latency by
~10%.

To demonstrate the latency and throughput improvements, a comparison was
made of neper tcp_rr running with:
1. (baseline) No busylooping

Is there NAPI-based steering to threads via SO_INCOMING_NAPI_ID in
this case? More details, please, on locality. If there is no
NAPI-based flow steering in this case, perhaps the improvements you
are seeing are a result of both syscall overhead avoidance and data
locality?


The benchmarks were run with no NAPI steering.

Regarding syscall overhead, I reproduced the above experiment with
mitigations=off
and found similar results as above. Pointing to the fact that the
above gains are
materialized from more than just avoiding syscall overhead.

I suppose the natural follow-up questions are:

1) Where do the gains come from? and

2) Would they materialize with a realistic application?

System calls have some overhead even with mitigations=off. In fact I understand on modern CPUs security mitigations are not that expensive to begin with? In a micro-benchmark that does nothing else but bouncing packets back and forth, this overhead might look more significant than in a realistic application?

It seems your change does not eliminate any processing from each packet's path, but instead eliminates processing in between packet arrivals? This might lead to a small latency improvement, which might turn into a small throughput improvement in these micro-benchmarks, but that might quickly evaporate when an application has actual work to do in between packet arrivals.

It would be good to know a little more about your experiments. You are referring to 5 threads, but does that mean 5 cores were busy on both client and server during the experiment? Which of client or server is the bottleneck? In your baseline experiment, are all 5 server cores busy? How many RX queues are in play and how is interrupt routing configured?

Thanks,
Martin