Re: [RFC PATCH 0/9] kvm: implement atomic memslot updates

From: Paolo Bonzini
Date: Wed Sep 28 2022 - 11:08:39 EST


On 9/27/22 17:58, Sean Christopherson wrote:
On Tue, Sep 27, 2022, Emanuele Giuseppe Esposito wrote:

Am 26/09/2022 um 23:28 schrieb Sean Christopherson:
On Mon, Sep 26, 2022, David Hildenbrand wrote:
As Sean said "This is an awful lot of a complexity to take on for something
that appears to be solvable in userspace."

And if the userspace solution is unpalatable for whatever reason, I'd like to
understand exactly what KVM behavior is problematic for userspace. E.g. the
above RHBZ bug should no longer be an issue as the buggy commit has since been
reverted.

It still is because I can reproduce the bug, as also pointed out in
multiple comments below.

You can reproduce _a_ bug, but it's obviously not the original bug, because the
last comment says:

Second, indeed the patch was reverted and somehow accepted without generating
too much noise:

...

The underlying issue of course as we both know is still there.

You might have luck reproducing it with this bug

https://bugzilla.redhat.com/show_bug.cgi?id=1855298

But for me it looks like it is 'working' as well, so you might have
to write a unit test to trigger the issue.

If the issue is KVM doing something nonsensical on a code fetch to MMIO, then I'd
much rather fix _that_ bug and improve KVM's user exit ABI to let userspace handle
the race _if_ userspace chooses not to pause vCPUs.


Also on the BZ they all seem (Paolo included) to agree that the issue is
non-atomic memslots update.

Yes, non-atomic memslot likely results in the guest fetching from a GPA without a
memslot. I'm asking for an explanation of exactly what happens when that occurs,
because it should be possible to adjust KVM and/or QEMU to play nice with the
fetch, e.g. to resume the guest until the new memslot is installed, in which case
an atomic update isn't needed.

I assume the issue is that KVM exits with KVM_EXIT_INTERNAL_ERROR because the
guest is running at CPL=0, and QEMU kills the guest in response. If that's correct,
then that problem can be solved by exiting to userspace with KVM_EXIT_MMIO instead
of KVM_EXIT_INTERNAL_ERROR so that userspace can do something sane in response to
the MMIO code fetch.

I'm pretty sure this patch will Just Work for QEMU, because QEMU simply resumes
the vCPU if mmio.len==0. It's a bit of a hack, but I don't think it violates KVM's
ABI in any way, and it can even become "official" behavior since KVM x86 doesn't
otherwise exit with mmio.len==0.

I think this patch is not a good idea for two reasons:

1) we don't know how userspace behaves if mmio.len is zero. It is of course reasonable to do nothing, but an assertion failure is also a valid behavior

2) more important, there is no way to distinguish a failure due to the guest going in the weeds (and then KVM_EXIT_INTERNAL_ERROR is fine) from one due to the KVM_SET_USER_MEMORY_REGION race condition. So this will cause a guest that correctly caused an internal error to loop forever.

While the former could be handled in a "wait and see" manner, the latter in particular is part of the KVM_RUN contract. Of course it is possible for a guest to just loop forever, but in general all of KVM, QEMU and upper userspace layers want a crashed guest to be detected and stopped forever.

Yes, QEMU could loop only if memslot updates are in progress, but honestly all the alternatives I have seen to atomic memslot updates are really *awful*. David's patches even invent a new kind of mutex for which I have absolutely no idea what kind of deadlocks one should worry about and why they should not exist; QEMU's locking is already pretty crappy, it's certainly not on my wishlist to make it worse!

This is clearly a deficiency in the KVM kernel API, and (thanks to SRCU) the kernel is the only place where you can have a *good* fix. It should have been fixed years ago.

Paolo