Re: [RFC PATCH 14/18] KVM: Add asynchronous userfaults, KVM_READ_USERFAULT

From: Nikita Kalyazin
Date: Mon Jul 29 2024 - 13:18:28 EST


On 26/07/2024 19:00, James Houghton wrote:
If it would be useful, we could absolutely have a flag to have all
faults go through the asynchronous mechanism. :) It's meant to just be
an optimization. For me, it is a necessary optimization.

Userfaultfd doesn't scale particularly well: we have to grab two locks
to work with the wait_queues. You could create several userfaultfds,
but the underlying issue is still there. KVM Userfault, if it uses a
wait_queue for the async fault mechanism, will have the same
bottleneck. Anish and I worked on making userfaults more scalable for
KVM[1], and we ended up with a scheme very similar to what we have in
this KVM Userfault series.
Yes, I see your motivation. Does this approach support async pagefaults [1]? Ie would all the guest processes on the vCPU need to stall until a fault is resolved or is there a way to let the vCPU run and only block the faulted process?

A more general question is, it looks like Userfaultfd's main purpose was to support the postcopy use case [2], yet it fails to do that efficiently for large VMs. Would it be ideologically better to try to improve Userfaultfd's performance (similar to how it was attempted in [3]) or is that something you have already looked into and reached a dead end as a part of [4]?

[1] https://lore.kernel.org/lkml/4AEFB823.4040607@xxxxxxxxxx/T/
[2] https://lwn.net/Articles/636226/
[3] https://lore.kernel.org/lkml/20230905214235.320571-1-peterx@xxxxxxxxxx/
[4] https://lore.kernel.org/linux-mm/CADrL8HVDB3u2EOhXHCrAgJNLwHkj2Lka1B_kkNb0dNwiWiAN_Q@xxxxxxxxxxxxxx/

My use case already requires using a reasonably complex API for
interacting with a separate userland process for fetching memory, and
it's really fast. I've never tried to hook userfaultfd into this other
process, but I'm quite certain that [1] + this process's interface
scale better than userfaultfd does. Perhaps userfaultfd, for
not-so-scaled-up cases, could be *slightly* faster, but I mostly care
about what happens when we scale to hundreds of vCPUs.

[1]: https://lore.kernel.org/kvm/20240215235405.368539-1-amoorthy@xxxxxxxxxx/
Do I understand it right that in your setup, when an EPT violation occurs,
- VMM shares the fault information with the other process via a userspace protocol
- the process fetches the memory, installs it (?) and notifies VMM
- VMM calls KVM run to resume execution
?
Would you be ok to share an outline of the API you mentioned?

How do you envision resolving faults in userspace? Copying the page in
(provided that userspace mapping of guest_memfd is supported [3]) and
clearing the KVM_MEMORY_ATTRIBUTE_USERFAULT alone do not look
sufficient to resolve the fault because an attempt to copy the page
directly in userspace will trigger a fault on its own

This is not true for KVM Userfault, at least for right now. Userspace
accesses to guest memory will not trigger KVM Userfaults. (I know this
name is terrible -- regular old userfaultfd() userfaults will indeed
get triggered, provided you've set things up properly.)

KVM Userfault is merely meant to catch KVM's own accesses to guest
memory (including vCPU accesses). For non-guest_memfd memslots,
userspace can totally just write through the VMA it has made (KVM
Userfault *cannot*, by virtue of being completely divorced from mm,
intercept this access). For guest_memfd, userspace could write to
guest memory through a VMA if that's where guest_memfd is headed, but
perhaps it will rely on exact details of how userspace is meant to
populate guest_memfd memory.
True, it isn't the case right now. I think I fast-forwarded to a state where notifications about VMM-triggered faults to the guest_memfd are also sent asynchronously.

In case it's interesting or useful at all, we actually use
UFFDIO_CONTINUE for our live migration use case. We mmap() memory
twice -- one of them we register with userfaultfd and also give to
KVM. The other one we use to install memory -- our non-faulting view
of guest memory!
That is interesting. You're replacing UFFDIO_COPY (vma1) with a memcpy (vma2) + UFFDIO_CONTINUE (vma1), IIUC. Are both mappings created by the same process? What benefits does it bring?