Re: epoll_wait() performance

From: Eric Dumazet
Date: Wed Nov 27 2019 - 12:46:50 EST




On 11/27/19 9:30 AM, David Laight wrote:
> From: Paolo Abeni
>> Sent: 27 November 2019 16:27
> ...
>> @David: If I read your message correctly, the pkt rate you are dealing
>> with is quite low... are we talking about tput or latency? I guess
>> latency could be measurably higher with recvmmsg() in respect to other
>> syscall. How do you measure the releative performances of recvmmsg()
>> and recv() ? with micro-benchmark/rdtsc()? Am I right that you are
>> usually getting a single packet per recvmmsg() call?
>
> The packet rate per socket is low, typically one packet every 20ms.
> This is RTP, so telephony audio.
> However we have a lot of audio channels and hence a lot of sockets.
> So there are can be 1000s of sockets we need to receive the data from.
> The test system I'm using has 16 E1 TDM links each of which can handle
> 31 audio channels.
> Forwarding all these to/from RTP (one of the things it might do) is 496
> audio channels - so 496 RTP sockets and 496 RTCP ones.
> Although the test I'm doing is pure RTP and doesn't use TDM.
>
> What I'm measuring is the total time taken to receive all the packets
> (on all the sockets) that are available to be read every 10ms.
> So poll + recv + add_to_queue.
> (The data processing is done by other threads.)
> I use the time difference (actually CLOCK_MONOTONIC - from rdtsc)
> to generate a 64 entry (self scaling) histogram of the elapsed times.
> Then look for the histograms peak value.
> (I need to work on the max value, but that is a different (more important!) problem.)
> Depending on the poll/recv method used this takes 1.5 to 2ms
> in each 10ms period.
> (It is faster if I run the cpu at full speed, but it usually idles along
> at 800MHz.)
>
> If I use recvmmsg() I only expect to see one packet because there
> is (almost always) only one packet on each socket every 20ms.
> However there might be more than one, and if there is they
> all need to be read (well at least 2 of them) in that block of receives.
>
> The outbound traffic goes out through a small number of raw sockets.
> Annoyingly we have to work out the local IPv4 address that will be used
> for each destination in order to calculate the UDP checksum.
> (I've a pending patch to speed up the x86 checksum code on a lot of
> cpus.)
>
> David

A QUIC server handles hundred of thousands of ' UDP flows' all using only one UDP socket
per cpu.

This is really the only way to scale, and does not need kernel changes to efficiently
organize millions of UDP sockets (huge memory footprint even if we get right how
we manage them)

Given that UDP has no state, there is really no point trying to have one UDP
socket per flow, and having to deal with epoll()/poll() overhead.