Re: [PATCH 0/2] Expose KVM API to Linux Kernel

From: Maxim Levitsky
Date: Mon May 18 2020 - 08:12:19 EST


On Mon, 2020-05-18 at 13:51 +0200, Paolo Bonzini wrote:
> On 18/05/20 13:34, Maxim Levitsky wrote:
> > > In high-performance configurations, most of the time virtio devices are
> > > processed in another thread that polls on the virtio rings. In this
> > > setup, the rings are configured to not cause a vmexit at all; this has
> > > much smaller latency than even a lightweight (kernel-only) vmexit,
> > > basically corresponding to writing an L1 cache line back to L2.
> >
> > This can be used to run kernel drivers inside a very thin VM IMHO to break up the stigma,
> > that kernel driver is always a bad thing to and should be by all means replaced by a userspace driver,
> > something I see a lot lately, and what was the ground for rejection of my nvme-mdev proposal.
>
> It's a tought design decision between speeding up a kernel driver with
> something like eBPF or wanting to move everything to userspace.
>
> Networking has moved more towards the first because there are many more
> opportunities for NIC-based acceleration, while storage has moved
> towards the latter with things such as io_uring. That said, I don't see
> why in-kernel NVMeoF drivers would be acceptable for anything but Fibre
> Channel (and that's only because FC HBAs try hard to hide most of the
> SAN layers).
>
> Paolo
>

Note that these days storage is as fast or even faster that many types of networking,
and that there also are opportunities for acceleration (like p2p pci dma) that also are more
natural to do in the kernel.

io-uring is actually not about moving everything to userspace IMHO, but rather the opposite,
it allows the userspace to access the kernel block subsystem in very efficent way which
is the right thing to do.

Sadly it doesn't help much with fast NVME virtualization because the bottleneck moves
to the communication with the guest.

I guess this is getting offtopic, so I won't continue this discussion here,
I just wanted to voice my opinion on this manner.

Another thing that comes to my mind (not that it has to be done in the kernel),
is that AMD's AVIC allows peer to peer interrupts between guests, and that
can in theory allow to run a 'driver' in a special guest and let it communicate
with a normal guest using interrupts bi-directionally which can finally solve the
need to waste a core in a busy wait loop.

The only catch is that the 'special guest' has to run 100% of the time,
thus it can't still share a core with other kernel/usespace tasks, but at least it can be in sleeping
state most of the time, and it can itsel run various tasks that serve various needs.

In other words, I don't have any objection to allowing part of the host kernel to run in VMX/SVM
guest mode. This can be a very intersting thing.

Best regards,
Maxim Levitsky