On Thu, Feb 20, 2025, Nikita Kalyazin wrote:
On 19/02/2025 15:17, Sean Christopherson wrote:
On Wed, Feb 12, 2025, Nikita Kalyazin wrote:
The conundrum with userspace async #PF is that if userspace is given only a single
bit per gfn to force an exit, then KVM won't be able to differentiate between
"faults" that will be handled synchronously by the vCPU task, and faults that
usersepace will hand off to an I/O task. If the fault is handled synchronously,
KVM will needlessly inject a not-present #PF and a present IRQ.
Right, but from the guest's point of view, async PF means "it will probably
take a while for the host to get the page, so I may consider doing something
else in the meantime (ie schedule another process if available)".
Except in this case, the guest never gets a chance to run, i.e. it can't do
something else. From the guest point of view, if KVM doesn't inject what is
effectively a spurious async #PF, the VM-Exiting instruction simply took a (really)
long time to execute.
If we are exiting to userspace, it isn't going to be quick anyway, so we can
consider all such faults "long" and warranting the execution of the async PF
protocol. So always injecting a not-present #PF and page ready IRQ doesn't
look too wrong in that case.
There is no "wrong", it's simply wasteful. The fact that the userspace exit is
"long" is completely irrelevant. Decompressing zswap is also slow, but it is
done on the current CPU, i.e. is not background I/O, and so doesn't trigger async
#PFs.
In the guest, if host userspace resolves the fault before redoing KVM_RUN, the
vCPU will get two events back-to-back: an async #PF, and an IRQ signalling completion
of that #PF.
What advantage can you see in it over exiting to userspace (which already exists
in James's series)?
It doesn't exit to userspace :-)
If userspace simply wakes a different task in response to the exit, then KVM
should be able to wake said task, e.g. by signalling an eventfd, and resume the
guest much faster than if the vCPU task needs to roundtrip to userspace. Whether
or not such an optimization is worth the complexity is an entirely different
question though.
This reminds me of the discussion about VMA-less UFFD that was coming up
several times, such as [1], but AFAIK hasn't materialised into something
actionable. I may be wrong, but James was looking into that and couldn't
figure out a way to scale it sufficiently for his use case and had to stick
with the VM-exit-based approach. Can you see a world where VM-exit
userfaults coexist with no-VM-exit way of handling async PFs?
The issue with UFFD is that it's difficult to provide a generic "point of contact",
whereas with KVM userfault, signalling can be tied to the vCPU, and KVM can provide
per-vCPU buffers/structures to aid communication.
That said, supporting "exitless" KVM userfault would most definitely be premature
optimization without strong evidence it would benefit a real world use case.