Re: [PATCH RFC 0/4] vsock/virtio: optimizations to increase the throughput

From: Stefano Garzarella
Date: Thu Apr 04 2019 - 12:47:22 EST

Next message: Borislav Petkov: "Re: [PATCH] x86/microcode: Refactor Intel microcode loading"
Previous message: Cyrill Gorcunov: "Re: perf: perf_fuzzer crashes on Pentium 4 systems"
In reply to: Michael S. Tsirkin: "Re: [PATCH RFC 0/4] vsock/virtio: optimizations to increase the throughput"
Next in thread: Michael S. Tsirkin: "Re: [PATCH RFC 0/4] vsock/virtio: optimizations to increase the throughput"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Thu, Apr 04, 2019 at 11:52:46AM -0400, Michael S. Tsirkin wrote:
> I simply love it that you have analysed the individual impact of
> each patch! Great job!

Thanks! I followed Stefan's suggestions!

>
> For comparison's sake, it could be IMHO benefitial to add a column
> with virtio-net+vhost-net performance.
>
> This will both give us an idea about whether the vsock layer introduces
> inefficiencies, and whether the virtio-net idea has merit.
>

Sure, I already did TCP tests on virtio-net + vhost, starting qemu in
this way:
$ qemu-system-x86_64 ... \
-netdev tap,id=net0,vhost=on,ifname=tap0,script=no,downscript=no \
-device virtio-net-pci,netdev=net0

I did also a test using TCP_NODELAY, just to be fair, because VSOCK
doesn't implement something like this.
In both cases I set the MTU to the maximum allowed (65520).

VSOCK TCP + virtio-net + vhost
host -> guest [Gbps] host -> guest [Gbps]
pkt_size before opt. patch 1 patches 2+3 patch 4 TCP_NODELAY
64 0.060 0.102 0.102 0.096 0.16 0.15
256 0.22 0.40 0.40 0.36 0.32 0.57
512 0.42 0.82 0.85 0.74 1.2 1.2
1K 0.7 1.6 1.6 1.5 2.1 2.1
2K 1.5 3.0 3.1 2.9 3.5 3.4
4K 2.5 5.2 5.3 5.3 5.5 5.3
8K 3.9 8.4 8.6 8.8 8.0 7.9
16K 6.6 11.1 11.3 12.8 9.8 10.2
32K 9.9 15.8 15.8 18.1 11.8 10.7
64K 13.5 17.4 17.7 21.4 11.4 11.3
128K 17.9 19.0 19.0 23.6 11.2 11.0
256K 18.0 19.4 19.8 24.4 11.1 11.0
512K 18.4 19.6 20.1 25.3 10.1 10.7

For small packet size (< 4K) I think we should implement some kind of
batching/merging, that could be for free if we use virtio-net as a transport.

Note: Maybe I have something miss configured because TCP on virtio-net
for host -> guest case doesn't exceed 11 Gbps.

VSOCK TCP + virtio-net + vhost
guest -> host [Gbps] guest -> host [Gbps]
pkt_size before opt. patch 1 patches 2+3 TCP_NODELAY
64 0.088 0.100 0.101 0.24 0.24
256 0.35 0.36 0.41 0.36 1.03
512 0.70 0.74 0.73 0.69 1.6
1K 1.1 1.3 1.3 1.1 3.0
2K 2.4 2.4 2.6 2.1 5.5
4K 4.3 4.3 4.5 3.8 8.8
8K 7.3 7.4 7.6 6.6 20.0
16K 9.2 9.6 11.1 12.3 29.4
32K 8.3 8.9 18.1 19.3 28.2
64K 8.3 8.9 25.4 20.6 28.7
128K 7.2 8.7 26.7 23.1 27.9
256K 7.7 8.4 24.9 28.5 29.4
512K 7.7 8.5 25.0 28.3 29.3

For guest -> host I think is important the TCP_NODELAY test, because TCP
buffering increases a lot the throughput.

> One other comment: it makes sense to test with disabling smap
> mitigations (boot host and guest with nosmap). No problem with also
> testing the default smap path, but I think you will discover that the
> performance impact of smap hardening being enabled is often severe for
> such benchmarks.

Thanks for this valuable suggestion, I'll redo all the tests with nosmap!

Cheers,
Stefano

Next message: Borislav Petkov: "Re: [PATCH] x86/microcode: Refactor Intel microcode loading"
Previous message: Cyrill Gorcunov: "Re: perf: perf_fuzzer crashes on Pentium 4 systems"
In reply to: Michael S. Tsirkin: "Re: [PATCH RFC 0/4] vsock/virtio: optimizations to increase the throughput"
Next in thread: Michael S. Tsirkin: "Re: [PATCH RFC 0/4] vsock/virtio: optimizations to increase the throughput"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]