Re: [PATCH net-next v3 2/2] vsock/virtio: avoid queuing packets when work queue is empty

From: Stefano Garzarella
Date: Fri Jul 12 2024 - 10:58:37 EST


On Thu, Jul 11, 2024 at 04:58:47PM GMT, Luigi Leonardi via B4 Relay wrote:
From: Luigi Leonardi <luigi.leonardi@xxxxxxxxxxx>

Introduce an optimization in virtio_transport_send_pkt:
when the work queue (send_pkt_queue) is empty the packet is

Note: send_pkt_queue is just a queue of sk_buff, is not really a work queue.

put directly in the virtqueue increasing the throughput.

Why?

I'd write something like this, but feel free to change it:

When the driver needs to send new packets to the device, it always
queues the new sk_buffs into an intermediate queue (send_pkt_queue)
and schedules a worker (send_pkt_work) to then queue them into the
virtqueue exposed to the device.

This increases the chance of batching, but also introduces a lot of
latency into the communication. So we can optimize this path by
adding a fast path to be taken when there is no element in the
intermediate queue, there is space available in the virtqueue,
and no other process that is sending packets (tx_lock held).



In the following benchmark (pingpong mode) the host sends

"fio benchmark"

a payload to the guest and waits for the same payload back.

All vCPUs pinned individually to pCPUs.
vhost process pinned to a pCPU
fio process pinned both inside the host and the guest system.

Host CPU: Intel i7-10700KF CPU @ 3.80GHz
Tool: Fio version 3.37-56
Env: Phys host + L1 Guest
Runtime-per-test: 50s
Mode: pingpong (h-g-h)
Test runs: 50
Type: SOCK_STREAM

Before: Linux 6.9.7

Payload 512B:

1st perc. overall 99th perc.
Before 370 810.15 8656 ns
After 374 780.29 8741 ns

Payload 4K:

1st perc. overall 99th perc.
Before 460 1720.23 42752 ns
After 460 1520.84 36096 ns

The performance improvement is related to this optimization,
I used ebpf to check that each packet was sent directly to the
virtqueue.

Throughput: iperf-vsock

I would reorganize the description for a moment because it's a little confusing. For example like this:

The following benchmarks were run to check improvements in latency and throughput. The test bed is a host with Intel i7-10700KF CPU @ 3.80GHz and L1 guest running on QEMU/KVM.

- Latency
Tool: ...

- Throughput
Tool: ...

The size represents the buffer length (-l) to read/write
P represents the number parallel streams

P=1
4K 64K 128K
Before 6.87 29.3 29.5 Gb/s
After 10.5 39.4 39.9 Gb/s

P=2
4K 64K 128K
Before 10.5 32.8 33.2 Gb/s
After 17.8 47.7 48.5 Gb/s

P=4
4K 64K 128K
Before 12.7 33.6 34.2 Gb/s
After 16.9 48.1 50.5 Gb/s

Wow, great! I'm a little surprised that the latency is not much affected, but the throughput benefits so much with that kind of optimization.

Maybe we can check the latency with smaller payloads like 64 bytes or even smaller.


Co-developed-by: Marco Pinna <marco.pinn95@xxxxxxxxx>
Signed-off-by: Marco Pinna <marco.pinn95@xxxxxxxxx>
Signed-off-by: Luigi Leonardi <luigi.leonardi@xxxxxxxxxxx>
---
net/vmw_vsock/virtio_transport.c | 38 ++++++++++++++++++++++++++++++++++----
1 file changed, 34 insertions(+), 4 deletions(-)

diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
index c4205c22f40b..d75727fdc35f 100644
--- a/net/vmw_vsock/virtio_transport.c
+++ b/net/vmw_vsock/virtio_transport.c
@@ -208,6 +208,29 @@ virtio_transport_send_pkt_work(struct work_struct *work)
queue_work(virtio_vsock_workqueue, &vsock->rx_work);
}

+/* Caller need to hold RCU for vsock.
+ * Returns 0 if the packet is successfully put on the vq.
+ */
+static int virtio_transport_send_skb_fast_path(struct virtio_vsock *vsock, struct sk_buff *skb)
+{
+ struct virtqueue *vq = vsock->vqs[VSOCK_VQ_TX];
+ int ret;
+
+ /* Inside RCU, can't sleep! */
+ ret = mutex_trylock(&vsock->tx_lock);
+ if (unlikely(ret == 0))
+ return -EBUSY;
+
+ ret = virtio_transport_send_skb(skb, vq, vsock);
+
+ mutex_unlock(&vsock->tx_lock);
+
+ /* Kick if virtio_transport_send_skb succeeded */

Superfluous comment, we can remove it.

+ if (ret == 0)
+ virtqueue_kick(vq);

nit: I'd add a blank line here after the if block to highlight that the return is out.

+ return ret;
+}
+
static int
virtio_transport_send_pkt(struct sk_buff *skb)
{
@@ -231,11 +254,18 @@ virtio_transport_send_pkt(struct sk_buff *skb)
goto out_rcu;
}

- if (virtio_vsock_skb_reply(skb))
- atomic_inc(&vsock->queued_replies);
+ /* If the workqueue (send_pkt_queue) is empty there is no need to enqueue the packet.

Again, send_pkt_queue is not a workqueue.

Here I would explain more why there is no need, the fact that we are not doing this is clear.

+ * Just put it on the virtqueue using virtio_transport_send_skb_fast_path.
+ */


nit: here I would instead remove the blank line to make it clear that the comment is about the code below.

- virtio_vsock_skb_queue_tail(&vsock->send_pkt_queue, skb);
- queue_work(virtio_vsock_workqueue, &vsock->send_pkt_work);
+ if (!skb_queue_empty_lockless(&vsock->send_pkt_queue) ||
+ virtio_transport_send_skb_fast_path(vsock, skb)) {
+ /* Packet must be queued */

Please, include it in the comment before the if where you can explain the whole logic of the optimization.

+ if (virtio_vsock_skb_reply(skb))
+ atomic_inc(&vsock->queued_replies);

nit: blank line, how it was before this patch:

if (virtio_vsock_skb_reply(skb))
atomic_inc(&vsock->queued_replies);

virtio_vsock_skb_queue_tail(&vsock->send_pkt_queue, skb);
queue_work(virtio_vsock_workqueue, &vsock->send_pkt_work);


+ virtio_vsock_skb_queue_tail(&vsock->send_pkt_queue, skb);
+ queue_work(virtio_vsock_workqueue, &vsock->send_pkt_work);
+ }

out_rcu:
rcu_read_unlock();

--
2.45.2



I tested the patch and everything seems to be fine, all my comments are minor and style, the code should be fine!

Thanks,
Stefano