Re: [PATCH 0/5] sys_ringbuffer

From: Stefan Hajnoczi
Date: Thu Jun 06 2024 - 21:50:14 EST


On Sun, Jun 02, 2024 at 08:32:57PM -0400, Kent Overstreet wrote:
> New syscall for mapping generic ringbuffers for arbitary (supported)
> file descriptors.
>
> Ringbuffers can be created either when requested or at file open time,
> and can be mapped into multiple address spaces (naturally, since files
> can be shared as well).
>
> Initial motivation is for fuse, but I plan on adding support to pipes
> and possibly sockets as well - pipes are a particularly interesting use
> case, because if both the sender and receiver of a pipe opt in to the
> new ringbuffer interface, we can make them the _same_ ringbuffer for
> true zero copy IO, while being backwards compatible with existing pipes.

Hi Kent,
I recently came across a similar use case where the ability to "upgrade"
an fd into a more efficient interface would be useful like in this pipe
scenario you are describing.

My use case is when you have a block device using the ublk driver. ublk
lets userspace servers implement block devices. ublk is great when
compatibility is required with applications that expect block device
fds, but when an application is willing to implement a shared memory
interface to communicate directly with the ublk server then going
through a block device is inefficient.

In my case the application is QEMU, where the virtual machine runs a
virtio-blk driver that could talk directly to the ublk server via
vhost-user-blk. vhost-user-blk is a protocol that allows the virtual
machine to talk directly to the ublk server via shared memory without
going through QEMU or the host kernel block layer.

QEMU would need a way to upgrade from a ublk block device file to a
vhost-user socket. Just like in your pipe example, this approach relies
on being able to go from a "compatibility" fd to a more efficient
interface gracefully when both sides support this feature.

The generic ringbuffer approach in this series would not work for
the vhost-user protocol because the client must be able to provide its
own memory and file descriptor passing is needed in general. The
protocol spec is here:
https://gitlab.com/qemu-project/qemu/-/blob/master/docs/interop/vhost-user.rst

A different way to approach the fd upgrading problem is to treat this as
an AF_UNIX connectivity feature rather than a new ring buffer API.
Imagine adding a new address type to AF_UNIX for looking up connections
in a struct file (e.g. the pipe fd) instead of on the file system (or
the other AF_UNIX address types).

The first program creates the pipe and also an AF_UNIX socket. It calls
bind(2) on the socket with the sockaddr_un path
"/dev/self/fd/<fd>/<discriminator>" where fd is a pipe fd and
discriminator is a string like "ring-buffer" that describes the
service/protocol. The AF_UNIX kernel code parses this special path and
stores an association with the pipe file for future connect(2) calls.
The program listens on the AF_UNIX socket and then continues doing its
stuff.

The second program runs and inherits the pipe fd on stdin. It creates an
AF_UNIX socket and attempts to connect(2) to
"/dev/self/fd/0/ring-buffer". The AF_UNIX kernel code parses this
special path and establishes a connection between the connecting and
listening sockets inside the pipe fd's struct file. If connect(2) fails
then the second program knows that this is an ordinary pipe that does
not support upgrading to ring buffer operation.

Now the AF_UNIX socket can be used to pass shared memory for the ring
buffer and futexes. This AF_UNIX approach also works for my ublk block
device to vhost-user-blk upgrade use case. It does not require a new
ring buffer API but instead involves extending AF_UNIX.

You have more use cases than just the pipe scenario, maybe my half-baked
idea won't cover all of them, but I wanted to see what you think.

Stefan

> the ringbuffer_wait and ringbuffer_wakeup syscalls are probably going
> away in a future iteration, in favor of just using futexes.
>
> In my testing, reading/writing from the ringbuffer 16 bytes at a time is
> ~7x faster than using read/write syscalls - and I was testing with
> mitigations off, real world benefit will be even higher.
>
> Kent Overstreet (5):
> darray: lift from bcachefs
> darray: Fix darray_for_each_reverse() when darray is empty
> fs: sys_ringbuffer
> ringbuffer: Test device
> ringbuffer: Userspace test helper
>
> MAINTAINERS | 7 +
> arch/x86/entry/syscalls/syscall_32.tbl | 3 +
> arch/x86/entry/syscalls/syscall_64.tbl | 3 +
> fs/Makefile | 2 +
> fs/bcachefs/Makefile | 1 -
> fs/bcachefs/btree_types.h | 2 +-
> fs/bcachefs/btree_update.c | 2 +
> fs/bcachefs/btree_write_buffer_types.h | 2 +-
> fs/bcachefs/fsck.c | 2 +-
> fs/bcachefs/journal_io.h | 2 +-
> fs/bcachefs/journal_sb.c | 2 +-
> fs/bcachefs/sb-downgrade.c | 3 +-
> fs/bcachefs/sb-errors_types.h | 2 +-
> fs/bcachefs/sb-members.h | 3 +-
> fs/bcachefs/subvolume.h | 1 -
> fs/bcachefs/subvolume_types.h | 2 +-
> fs/bcachefs/thread_with_file_types.h | 2 +-
> fs/bcachefs/util.h | 28 +-
> fs/ringbuffer.c | 474 ++++++++++++++++++++++++
> fs/ringbuffer_test.c | 209 +++++++++++
> {fs/bcachefs => include/linux}/darray.h | 61 +--
> include/linux/darray_types.h | 22 ++
> include/linux/fs.h | 2 +
> include/linux/mm_types.h | 4 +
> include/linux/ringbuffer_sys.h | 18 +
> include/uapi/linux/futex.h | 1 +
> include/uapi/linux/ringbuffer_sys.h | 40 ++
> init/Kconfig | 9 +
> kernel/fork.c | 2 +
> lib/Kconfig.debug | 5 +
> lib/Makefile | 2 +-
> {fs/bcachefs => lib}/darray.c | 12 +-
> tools/ringbuffer/Makefile | 3 +
> tools/ringbuffer/ringbuffer-test.c | 254 +++++++++++++
> 34 files changed, 1125 insertions(+), 62 deletions(-)
> create mode 100644 fs/ringbuffer.c
> create mode 100644 fs/ringbuffer_test.c
> rename {fs/bcachefs => include/linux}/darray.h (63%)
> create mode 100644 include/linux/darray_types.h
> create mode 100644 include/linux/ringbuffer_sys.h
> create mode 100644 include/uapi/linux/ringbuffer_sys.h
> rename {fs/bcachefs => lib}/darray.c (56%)
> create mode 100644 tools/ringbuffer/Makefile
> create mode 100644 tools/ringbuffer/ringbuffer-test.c
>
> --
> 2.45.1
>

Attachment: signature.asc
Description: PGP signature