Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory

From: Quentin Perret
Date: Wed Apr 06 2022 - 11:46:44 EST


On Tuesday 05 Apr 2022 at 10:51:36 (-0700), Andy Lutomirski wrote:
> Let's try actually counting syscalls and mode transitions, at least approximately. For non-direct IO (DMA allocation on guest side, not straight to/from pagecache or similar):
>
> Guest writes to shared DMA buffer. Assume the guest is smart and reuses the buffer.
> Guest writes descriptor to shared virtio ring.
> Guest rings virtio doorbell, which causes an exit.
> *** guest -> hypervisor -> host ***
> host reads virtio ring (mmaped shared memory)
> host does pread() to read the DMA buffer or reads mmapped buffer
> host does the IO
> resume guest
> *** host -> hypervisor -> guest ***
>
> This is essentially optimal in terms of transitions. The data is copied on the guest side (which may well be mandatory depending on what guest userspace did to initiate the IO) and on the host (which may well be mandatory depending on what the host is doing with the data).
>
> Now let's try straight-from-guest-pagecache or otherwise zero-copy on the guest side. Without nondestructive changes, the guest needs a bounce buffer and it looks just like the above. One extra copy, zero extra mode transitions. With nondestructive changes, it's a bit more like physical hardware with an IOMMU:
>
> Guest shares the page.
> *** guest -> hypervisor ***
> Hypervisor adds a PTE. Let's assume we're being very optimal and the host is not synchronously notified.
> *** hypervisor -> guest ***
> Guest writes descriptor to shared virtio ring.
> Guest rings virtio doorbell, which causes an exit.
> *** guest -> hypervisor -> host ***
> host reads virtio ring (mmaped shared memory)
>
> mmap *** syscall ***
> host does the IO
> munmap *** syscall, TLBI ***
>
> resume guest
> *** host -> hypervisor -> guest ***
> Guest unshares the page.
> *** guest -> hypervisor ***
> Hypervisor removes PTE. TLBI.
> *** hypervisor -> guest ***
>
> This is quite expensive. For small IO, pread() or splice() in the host may be a lot faster. Even for large IO, splice() may still win.

Right, that would work nicely for pages that are shared transiently, but
less so for long-term shares. But I guess your proposal below should do
the trick.

> I can imagine clever improvements. First, let's get rid of mmap() + munmap(). Instead use a special device mapping with special semantics, not regular memory. (mmap and munmap are expensive even ignoring any arch and TLB stuff.) The rule is that, if the page is shared, access works, and if private, access doesn't, but it's still mapped. The hypervisor and the host cooperate to make it so.

As long as the page can't be GUP'd I _think_ this shouldn't be a
problem. We can have the hypervisor re-inject the fault in the host. And
the host fault handler will deal with it just fine if the fault was
taken from userspace (inject a SEGV), or from the kernel through uaccess
macros. But we do get into issues if the host kernel can be tricked into
accessing the page via e.g. kmap(). I've been able to trigger this by
strace-ing a userspace process which passes a pointer to private memory
to a syscall. strace will inspect the syscall argument using
process_vm_readv(), which will pin_user_pages_remote() and access the
page via kmap(), and then we're in trouble. But preventing GUP would
prevent this by construction I think?

FWIW memfd_secret() did look like a good solution to this, but it lacks
the bidirectional notifiers with KVM that is offered by this patch
series, which is needed to allow KVM to handle guest faults, and also
offers a good framework to support future extensions (e.g.
hypervisor-assisted page migration, swap, ...). So yes, ideally
pKVM would use a kind of hybrid between memfd_secret and the private fd
proposed here, or something else providing similar properties.

Thanks,
Quentin