Re: [PATCH net-next] virtio-net: switch to use build_skb() for small buffer

From: Jason Wang
Date: Wed Feb 22 2017 - 05:36:25 EST

On 2017å02æ22æ 11:42, Michael S. Tsirkin wrote:
On Wed, Feb 22, 2017 at 11:17:50AM +0800, Jason Wang wrote:

On 2017å02æ22æ 11:06, Michael S. Tsirkin wrote:
On Wed, Feb 22, 2017 at 10:58:08AM +0800, Jason Wang wrote:
On 2017å02æ21æ 22:37, Michael S. Tsirkin wrote:
On Tue, Feb 21, 2017 at 04:46:28PM +0800, Jason Wang wrote:
This patch switch to use build_skb() for small buffer which can have
better performance for both TCP and XDP (since we can work at page
before skb creation). It also remove lots of XDP codes since both
mergeable and small buffer use page frag during refill now.

Before | After
XDP_DROP(xdp1) 64B : 11.1Mpps | 14.4Mpps

Tested with xdp1/xdp2/xdp_ip_tx_tunnel and netperf.

Signed-off-by: Jason Wang<jasowang@xxxxxxxxxx>
I had a similar patch for mergeable too, though it's trickier there
as host has a lot of flexibility in sizing buffers.
Looks like a good intermediate step to me.
Yes, I think it's more tricky for the case of mergeable buffer:

1) we need reserve NET_SKB_PAD + NET_IP_ALIGN for each buffer, this will
break rx frag coalescing
2) need tailroom for skb_shinfo, so it won't work for all size of packet

Have you seen my prototype?
Not yet, please give me a pointer.

It works with qemu in practice,
just needs to cover a bunch of corner cases.
Sounds good, if you wish I can help to finalize it.


Acked-by: Michael S. Tsirkin<mst@xxxxxxxxxx>

Great! Here it is I think it's on top of 4.9 or so. Issues to address:
- when ring is very small this might cause us not to
be able to fit even a single packet in the whole queue.
Limit logic to when ring is big enough?
- If e.g. header is split across many buffers, this won't work
correctly. Detect and copy?

Could be more issues that I forgot. It might be a good idea to
first finish my buffer ctx hack. Will simplify the logic.

commit 77c177091a7c8473e3f0ed148afa2b4765b8b0f7
Author: Michael S. Tsirkin <mst@xxxxxxxxxx>
Date: Sun Feb 21 20:22:31 2016 +0200

virtio_net: switch to build_skb for mrg_rxbuf
For small packets data copy was observed to
take up about 15% CPU time. Switch to build_skb
and avoid the copy when using mergeable rx buffers.
As a bonus, medium-size skbs that fit in a page will be
completely linear.
Of course, we now need to lower the lower bound on packet size,
to make sure a sane number of skbs fits in rx socket buffer.
By how much? I don't know yet.
It might also be useful to prefetch the packet buffer since
net stack will likely use it soon.

TODO: it appears that Linux won't handle correctly the case of first
buffer being very small (or consisting exclusively of virtio header).
This is already the case for current code, so don't bother
testing yet.
TODO: might be unfair to the last packet in a fragment as we include
remaining space if any in its truesize.
Signed-off-by: Michael S. Tsirkin <mst@xxxxxxxxxx>

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index b425fa1..a6b996f 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -38,6 +38,8 @@ module_param(gso, bool, 0444);


- if (unlikely(!skb))
- return;
- hdr = skb_vnet_hdr(skb);
stats->rx_bytes += skb->len;
@@ -581,11 +624,14 @@ static int add_recvbuf_big(struct virtnet_info *vi, struct receive_queue *rq,
static unsigned int get_mergeable_buf_len(struct ewma_pkt_len *avg_pkt_len)
- const size_t hdr_len = sizeof(struct virtio_net_hdr_mrg_rxbuf);
+ unsigned int hdr;
unsigned int len;
- len = hdr_len + clamp_t(unsigned int, ewma_pkt_len_read(avg_pkt_len),
+ hdr = ALIGN(VNET_SKB_PAD + SKB_DATA_ALIGN(sizeof(struct skb_shared_info)),

So we need to reserve sizeof(struct skb_shared_info) for each buffer. If we have several 4K buffers for a single large packet, about 8% truesize were wasted?

+ len = hdr + clamp_t(unsigned int, ewma_pkt_len_read(avg_pkt_len),
@@ -601,8 +647,11 @@ static int add_recvbuf_mergeable(struct receive_queue *rq, gfp_t gfp)
if (unlikely(!skb_page_frag_refill(len, alloc_frag, gfp)))
return -ENOMEM;
- buf = (char *)page_address(alloc_frag->page) + alloc_frag->offset;
- ctx = mergeable_buf_to_ctx(buf, len);
+ buf = (char *)page_address(alloc_frag->page) + alloc_frag->offset +

Rx frag coalescing won't work any more and we will end up with more fraglist for large packets.


+ //ctx = mergeable_buf_to_ctx(buf - VNET_SKB_OFF, len);
alloc_frag->offset += len;
hole = alloc_frag->size - alloc_frag->offset;
@@ -615,8 +664,10 @@ static int add_recvbuf_mergeable(struct receive_queue *rq, gfp_t gfp)
len += hole;
alloc_frag->offset += hole;
+ ctx = mergeable_buf_to_ctx(buf - VNET_SKB_OFF, len);
- sg_init_one(rq->sg, buf, len);
+ sg_init_one(rq->sg, buf,
+ len - VNET_SKB_OFF - SKB_DATA_ALIGN(sizeof(struct skb_shared_info)));
err = virtqueue_add_inbuf(rq->vq, rq->sg, 1, (void *)ctx, gfp);
if (err < 0)