[RFC PATCH v1 0/8] virtio/vsock: experimental zerocopy receive

From: Arseniy Krasnov
Date: Thu May 12 2022 - 01:05:15 EST


INTRODUCTION

Hello, this is experimental implementation of virtio vsock zerocopy
receive. It was inspired by TCP zerocopy receive by Eric Dumazet. This API uses
same idea: call 'mmap()' on socket's descriptor, then every 'getsockopt()' will
fill provided vma area with pages of virtio RX buffers. After received data was
processed by user, pages must be freed by 'madvise()' call with MADV_DONTNEED
flag set(if user won't call 'madvise()', next 'getsockopt()' will fail).

DETAILS

Here is how mapping with mapped pages looks exactly: first page mapping
contains array of trimmed virtio vsock packet headers (in contains only length
of data on the corresponding page and 'flags' field):

struct virtio_vsock_usr_hdr {
uint32_t length;
uint32_t flags;
};

Field 'length' allows user to know exact size of payload within each sequence
of pages and 'flags' allows user to handle SOCK_SEQPACKET flags(such as message
bounds or record bounds). All other pages are data pages from RX queue.

Page 0 Page 1 Page N

[ hdr1 .. hdrN ][ data ] .. [ data ]
| | ^ ^
| | | |
| *-------------------*
| |
| |
*----------------*

Of course, single header could represent array of pages (when packet's
buffer is bigger than one page).So here is example of detailed mapping layout
for some set of packages. Lets consider that we have the following sequence of
packages: 56 bytes, 4096 bytes and 8200 bytes. All pages: 0,1,2,3,4 and 5 will
be inserted to user's vma(vma is large enough).

Page 0: [[ hdr0 ][ hdr 1 ][ hdr 2 ][ hdr 3 ] ... ]
Page 1: [ 56 ]
Page 2: [ 4096 ]
Page 3: [ 4096 ]
Page 4: [ 4096 ]
Page 5: [ 8 ]

Page 0 contains only array of headers:
'hdr0' has 56 in length field.
'hdr1' has 4096 in length field.
'hdr2' has 8200 in length field.
'hdr3' has 0 in length field(this is end of data marker).

Page 1 corresponds to 'hdr0' and has only 56 bytes of data.
Page 2 corresponds to 'hdr1' and filled with data.
Page 3 corresponds to 'hdr2' and filled with data.
Page 4 corresponds to 'hdr2' and filled with data.
Page 5 corresponds to 'hdr2' and has only 8 bytes of data.

This patchset also changes packets allocation way: today implementation
uses only 'kmalloc()' to create data buffer. Problem happens when we try to map
such buffers to user's vma - kernel forbids to map slab pages to user's vma(as
pages of "not large" 'kmalloc()' allocations are marked with PageSlab flag and
"not large" could be bigger than one page). So to avoid this, data buffers now
allocated using 'alloc_pages()' call.

TESTS

This patchset updates 'vsock_test' utility: two tests for new feature
were added. First test covers invalid cases. Second checks valid transmission
case.

BENCHMARKING

For benchmakring I've added small utility 'rx_zerocopy'. It works in
client/server mode. When client connects to server, server starts sending exact
amount of data to client(amount is set as input argument).Client reads data and
waits for next portion of it. Client works in two modes: copy and zero-copy. In
copy mode client uses 'read()' call while in zerocopy mode sequence of 'mmap()'
/'getsockopt()'/'madvise()' are used. Smaller amount of time for transmission
is better. For server, we can set size of tx buffer and for client we can set
size of rx buffer or rx mapping size(in zerocopy mode). Usage of this utility
is quiet simple:

For client mode:

./rx_zerocopy --mode client [--zerocopy] [--rx]

For server mode:

./rx_zerocopy --mode server [--mb] [--tx]

[--mb] sets number of megabytes to transfer.
[--rx] sets size of receive buffer/mapping in pages.
[--tx] sets size of transmit buffer in pages.

I checked for transmission of 4000mb of data. Here are some results:

size of rx/tx buffers in pages
*---------------------------------------------------*
| 8 | 32 | 64 | 256 | 512 |
*--------------*--------*----------*---------*----------*----------*
| zerocopy | 24 | 10.6 | 12.2 | 23.6 | 21 | secs to
*--------------*---------------------------------------------------- process
| non-zerocopy | 13 | 16.4 | 24.7 | 27.2 | 23.9 | 4000 mb
*--------------*----------------------------------------------------

I think, that results are not so impressive, but at least it is not worse than
copy mode and there is no need to allocate memory for processing date.

PROBLEMS

Updated packet's allocation logic creates some problem: when host gets
data from guest(in vhost-vsock), it allocates at least one page for each packet
(even if packet has 1 byte payload). I think this could be resolved in several
ways:
1) Make zerocopy rx mode disabled by default, so if user didn't enable
it, current 'kmalloc()' way will be used.
2) Use 'kmalloc()' for "small" packets, else call page allocator. But
in this case, we have mix of packets, allocated in two different ways thus
during zerocopying to user(e.g. mapping pages to vma), such small packets will
be handled in some stupid way: we need to allocate one page for user, copy data
to it and then insert page to user's vma.

P.S: of course this is experimental RFC, so what do You think guys?

Arseniy Krasnov(8)
virtio/vsock: rework packet allocation logic
vhost/vsock: rework packet allocation logic
af_vsock: add zerocopy receive logic
virtio/vsock: add transport zerocopy callback
vhost/vsock: enable zerocopy callback.
virtio/vsock: enable zerocopy callback.
test/vsock: add receive zerocopy tests
test/vsock: vsock rx zerocopy utility

drivers/vhost/vsock.c | 50 ++++-
include/linux/virtio_vsock.h | 4 +
include/net/af_vsock.h | 4 +
include/uapi/linux/virtio_vsock.h | 5 +
include/uapi/linux/vm_sockets.h | 2 +
net/vmw_vsock/af_vsock.c | 61 ++++++
net/vmw_vsock/virtio_transport.c | 1 +
net/vmw_vsock/virtio_transport_common.c | 195 ++++++++++++++++-
tools/include/uapi/linux/virtio_vsock.h | 10 +
tools/include/uapi/linux/vm_sockets.h | 7 +
tools/testing/vsock/Makefile | 1 +
tools/testing/vsock/control.c | 34 +++
tools/testing/vsock/control.h | 2 +
tools/testing/vsock/rx_zerocopy.c | 356 ++++++++++++++++++++++++++++++++
tools/testing/vsock/vsock_test.c | 284 +++++++++++++++++++++++++
15 files changed, 1005 insertions(+), 11 deletions(-)

--
2.25.1