Re: [PATCH] softirq: let ksoftirqd do its job

From: Jesper Dangaard Brouer
Date: Thu Sep 01 2016 - 06:39:09 EST



On Wed, 31 Aug 2016 16:29:56 -0700 Rick Jones <rick.jones2@xxxxxxx> wrote:
> On 08/31/2016 04:11 PM, Eric Dumazet wrote:
> > On Wed, 2016-08-31 at 15:47 -0700, Rick Jones wrote:
> >> With regard to drops, are both of you sure you're using the same socket
> >> buffer sizes?
> >
> > Does it really matter ?
>
> At least at points in the past I have seen different drop counts at the
> SO_RCVBUF based on using (sometimes much) larger sizes. The hypothesis
> I was operating under at the time was that this dealt with those
> situations where the netserver was held-off from running for "a little
> while" from time to time. It didn't change things for a sustained
> overload situation though.

Yes, Rick, your hypothesis corresponds to my measurements. The
userspace program is held-off from running for "a little while" from
time to time. I've measured this with perf sched record/latency. It
is sort of a natural scheduler characteristic.
The userspace UDP socket program consume/need more cycles to perform
its jobs, than kernel softirqd. Thus the UDP-prog use up its sched
time-slice, and periodically ksoftirq get schedule multiple times,
because UDP-prog don't have any credits any-longer.

WARNING: Do not increase socket queue size to pamper over this issue,
it is the WRONG solution, it will give horrible latency issues.

With above warning, I can tell your, yes you are also right about
increasing the socket buffer size, can be used to mitigate/hide the
packet drops. You can even increase the socket size so much, that the
drop problem "goes-away". The queue simply need to be deep enough to
absorb the worst/maximum time UDP-prog was scheduled out. The hidden
effect to make this work (to not contradict queue theory) is that this
also slows-down/cost-more-cycles for ksoftirqd/NAPI as it cost more to
enqueue (instead of dropping packets on a full queue).

You can measure the sched "Maximum delay" using:
sudo perf sched record -C 0 sleep 10
sudo perf sched latency

On my setup I measured "Maximum delay" of approx 9 ms. Given I can
see an incoming packet rate of 2.4Mpps (880Kpps reach UDP-prog), and
knowing network stack use skb->truesize (approx 2048 bytes on this
driver), I can calculate that I need approx 45MBytes buffer
((2.4*10^6)*(9/1000)*2048 = 44.2Mb)

The PPS measurement comes from:

$ nstat > /dev/null && sleep 1 && nstat
#kernel
IpInReceives 2335926 0.0
IpInDelivers 2335925 0.0
UdpInDatagrams 880086 0.0
UdpInErrors 1455850 0.0
UdpRcvbufErrors 1455850 0.0
IpExtInOctets 107453056 0.0

Changing queue size to 50MBytes :
sysctl -w net/core/rmem_max=$((50*1024*1024)) ;\
sysctl -w net.core.rmem_default=$((50*1024*1024))

New result looks "nice", with no drops, and 1.42Mpps delivered to
UDP-prog, but in reality it is not nice for latency...

$ nstat > /dev/null && sleep 1 && nstat
#kernel
IpInReceives 1425013 0.0
IpInDelivers 1425017 0.0
UdpInDatagrams 1432139 0.0
IpExtInOctets 65539328 0.0
IpExtInNoECTPkts 1424771 0.0

Tracking of queue size, max, min and average::

while (true); do netstat -uan | grep '0.0.0.0:9'; sleep 0.3; done |
awk 'BEGIN {max=0;min=0xffffffff;sum=0;n=0} \
{if ($2 > max) max=$2;
if ($2 < min) min=$2;
n++; sum+=$2;
printf "%s Recv-Q: %d max: %d min: %d ave: %.3f\n",$1,$2,max,min,sum/n;}';
Result:
udp Recv-Q: 23624832 max: 47058176 min: 4352 ave: 25092687.698

I see max queue of 47MBytes, and worse an average standing queue of
25Mbytes, which is really bad for the latency seen by the
application. And having this much outstanding memory is also bad for
CPU cache size effects, and stressing the memory allocator.
I'm actually using this huge queue "misconfig" to stress the page
allocator and my page_pool implementation into worse case situations ;-)

--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
Author of http://www.iptv-analyzer.org
LinkedIn: http://www.linkedin.com/in/brouer