Re: [PATCH net-next] tuntap: introduce tx skb ring

From: Jason Wang
Date: Wed May 18 2016 - 06:42:21 EST




On 2016å05æ18æ 17:55, Michael S. Tsirkin wrote:
On Wed, May 18, 2016 at 11:21:29AM +0200, Jesper Dangaard Brouer wrote:
On Wed, 18 May 2016 11:21:59 +0300
"Michael S. Tsirkin" <mst@xxxxxxxxxx> wrote:

On Wed, May 18, 2016 at 10:16:31AM +0200, Jesper Dangaard Brouer wrote:
On Tue, 17 May 2016 09:38:37 +0800 Jason Wang <jasowang@xxxxxxxxxx> wrote:
And if tx_queue_length is not power of 2,
we probably need modulus to calculate the capacity.
Is that really that important for speed?
Not sure, I can test.
In my experience, yes, adding a modulus does affect performance.
How about simple
if (unlikely(++idx > size))
idx = 0;
So, you are exchanging an AND-operation with a mask, for a
branch-operation. If the branch predictor is good enough in the CPU
and code-"size" use-case, then I could be just as fast.

I've actually played with a lot of different approaches:
https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/include/linux/alf_queue_helpers.h

I cannot remember the exact results. I do remember micro benchmarking
showed good results with the advanced "unroll" approach, but IPv4
forwarding, where I know I-cache is getting evicted, showed best
results with the more simpler implementations.
This is all assuming you can somehow batch operations.
We can do this for transmit sometimes (when linux
is the source of the packets) but not always.

Right, this sounds a good solution.
Good idea.
I'm not that sure - it's clearly wasting memory.
Rounding up to power of two. In this case I don't think the memory
wast is too high. As we are talking about max 16 bytes elements.
It almost doubles it.
E.g. queue size of 10000 (rather common) will become 16K, wasting 6K.

It depends on the user, e.g default tx_queue_len is around 1000 for real cards. If we really care about the wasting, we can add a threshold and fall back to normal linked list during resizing.


I am concerned about memory in another way. We need to keep these
arrays/rings small, due to data cache usage. A 4096 ring queue is bad
because e.g. 16*4096=65536 bytes, and typical L1 cache is 32K-64K. As
this is a circular buffer, we walk over this memory all the time, thus
evicting the L1 cache.
Depends on the usage I guess.
Entries pointed to are much bigger, and you are
going to access them - is this really an issue?
If yes this shouldn't be that hard to fix ...

--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
Author of http://www.iptv-analyzer.org
LinkedIn: http://www.linkedin.com/in/brouer