RE: epoll_wait() performance

From: David Laight
Date: Wed Nov 27 2019 - 05:40:41 EST


From: Marek Majkowski
> Sent: 27 November 2019 09:51
> On Fri, Nov 22, 2019 at 12:18 PM David Laight <David.Laight@xxxxxxxxxx> wrote:
> > I'm trying to optimise some code that reads UDP messages (RTP and RTCP) from a lot of sockets.
> > The 'normal' data pattern is that there is no data on half the sockets (RTCP) and
> > one message every 20ms on the others (RTP).
> > However there can be more than one message on each socket, and they all need to be read.
> > Since the code processing the data runs every 10ms, the message receiving code
> > also runs every 10ms (a massive gain when using poll()).
>
> How many sockets we are talking about? More like 500 or 500k? We had very
> bad experience with UDP connected sockets, so if you are using UDP connected
> sockets, the RX path is super slow, mostly consumed by udp_lib_lookup()
> https://elixir.bootlin.com/linux/v5.4/source/net/ipv4/udp.c#L445

For my tests I have 900, but that is nothing like the limit for the application.
The test system is over 50% idle and running at its minimal clock speed.
The sockets are all unconnected, I believe the remote application is allowed
to change the source IP mid-flow!

...
> > While using recvmmsg() to read multiple messages might seem a good idea, it is much
> > slower than recv() when there is only one message (even recvmsg() is a lot slower).
> > (I'm not sure why the code paths are so slow, I suspect it is all the copy_from_user()
> > and faffing with the user iov[].)
> >
> > So using poll() we repoll the fd after calling recv() to find is there is a second message.
> > However the second poll has a significant performance cost (but less than using recvmmsg()).
>
> That sounds wrong. Single recvmmsg(), even when receiving only a
> single message, should be faster than two syscalls - recv() and
> poll().

My suspicion is the extra two copy_from_user() needed for each recvmsg are a
significant overhead, most likely due to the crappy code that tries to stop
the kernel buffer being overrun.
I need to run the tests on a system with a 'home built' kernel to see how much
difference this make (by seeing how much slower duplicating the copy makes it).

The system call cost of poll() gets factored over a reasonable number of sockets.
So doing poll() on a socket with no data is a lot faster that the setup for recvmsg
even allowing for looking up the fd.

This could be fixed by an extra flag to recvmmsg() to indicate that you only really
expect one message and to call the poll() function before each subsequent receive.

There is also the 'reschedule' that Eric added to the loop in recvmmsg.
I don't know how much that actually costs.
In this case the process is likely to be running at a RT priority and pinned to a cpu.
In some cases the cpu is also reserved (at boot time) so that 'random' other code can't use it.

We really do want to receive all these UDP packets in a timely manner.
Although very low latency isn't itself an issue.
The data is telephony audio with (typically) one packet every 20ms.
The code only looks for packets every 10ms - that helps no end since, in principle,
only a single poll()/epoll_wait() call (on all the sockets) is needed every 10ms.

> > If we use epoll() in level triggered mode a second epoll_wait() call (after the recv()) will
> > indicate that there is more data.
> >
> > For poll() it doesn't make much difference how many fd are supplied to each system call.
> > The overall performance is much the same for 32, 64 or 500 (all the sockets).
> >
> > For epoll_wait() that isn't true.
> > Supplying a buffer that is shorter than the list of 'ready' fds gives a massive penalty.
> > With a buffer long enough for all the events epoll() is somewhat faster than poll().
> > But with a 64 entry buffer it is much slower.
> > I've looked at the code and can't see why splicing the unread events back is expensive.
>
> Again, this is surprising.

Yep, but repeatedly measurable.
If no one else has seen this I'll have to try to instrument it in the kernel somehow.
I'm pretty sure it isn't a userspace issue.

> > I'd like to be able to change the code so that multiple threads are reading from the epoll fd.
> > This would mean I'd have to run it in edge mode and each thread reading a smallish
> > block of events.
> > Any suggestions on how to efficiently read the 'unusual' additional messages from
> > the sockets?
>
> Random ideas:
> 1. Perhaps reducing the number of sockets could help - with iptables or TPROXY.
> TPROXY has some performance impact though, so be careful.

We'd then have to use recvmsg() - which is measurably slower than recv().

> 2. I played with io_submit for syscall batching, but in my experiments I wasn't
> able to show performance boost:
> https://blog.cloudflare.com/io_submit-the-epoll-alternative-youve-never-heard-about/
> Perhaps the newer io_uring with networking support could help:
> https://twitter.com/axboe/status/1195047335182524416

You need an OS that actually does async IO - Like RSX11/M or windows.
Just deferring the request to a kernel thread can mean you get stuck
behind other processes blocking reads.

> 3. SO_BUSYPOLL drastically reduces latency, but I've only used it with
> a single socket..

We need to minimise the cpu cost more than the absolute latency.

> 4. If you want to get number of outstanding packets, there is SIOCINQ
> and SO_MEMINFO.

That's another system call.
poll() can tell use whether there is any data on a lot of sockets quicker.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)