Re: [RFC PATCH 0/9] kvm: implement atomic memslot updates

From: David Hildenbrand
Date: Fri Sep 23 2022 - 09:22:03 EST


On 23.09.22 15:10, Emanuele Giuseppe Esposito wrote:


Am 19/09/2022 um 19:30 schrieb David Hildenbrand:
On 19.09.22 09:53, David Hildenbrand wrote:
On 18.09.22 18:13, Emanuele Giuseppe Esposito wrote:


Am 09/09/2022 um 16:30 schrieb Sean Christopherson:
On Fri, Sep 09, 2022, Emanuele Giuseppe Esposito wrote:
KVM is currently capable of receiving a single memslot update through
the KVM_SET_USER_MEMORY_REGION ioctl.
The problem arises when we want to atomically perform multiple
updates,
so that readers of memslot active list avoid seeing incomplete states.

For example, in RHBZ
https://bugzilla.redhat.com/show_bug.cgi?id=1979276

I don't have access.  Can you provide a TL;DR?

You should be able to have access to it now.


we see how non atomic updates cause boot failure, because vcpus
will se a partial update (old memslot delete, new one not yet created)
and will crash.

Why not simply pause vCPUs in this scenario?  This is an awful lot
of a complexity
to take on for something that appears to be solvable in userspace.


I think it is not that easy to solve in userspace: see
https://lore.kernel.org/qemu-devel/20200312161217.3590-1-david@xxxxxxxxxx/



"Using pause_all_vcpus()/resume_all_vcpus() is not possible, as it will
temporarily drop the BQL - something most callers can't handle (esp.
when called from vcpu context e.g., in virtio code)."

Can you please comment on the bigger picture? The patch from me works
around *exactly that*, and for that reason, contains that comment.


FWIW, I hacked up my RFC to perform atomic updates on any memslot
transactions (not just resizes) where ranges do add overlap with ranges
to remove.

https://github.com/davidhildenbrand/qemu/tree/memslot


I only performed simple boot check under x86-64 (where I can see region
resizes) and some make checks -- pretty sure it has some rough edges;
but should indicate what's possible and what the possible price might
be. [one could wire up a new KVM ioctl and call it conditionally on
support if really required]


A benefit of my ioctl implementation is that could be also used by other
hypervisors, which then do not need to share this kind of "hack".
However, after also talking with Maxim and Paolo, we all agreed that the
main disadvantage of your approach is that is not scalable with the
number of vcpus. It is also inferior to stop *all* vcpus just to allow a
memslot update (KVM only pauses vCPUs that access the modified memslots
instead).

So I took some measurements, to see what is the performance difference
between my implementation and yours. I used a machine where I could
replicate the bug mentioned in bugzilla, an AMD EPYC 7413 24-Core
Processor with kernel 5.19.0 (+ my patches).

Then I measured the time it takes that QEMU spends in kvm_commit (ie in
memslot updates) while booting a VM. In other words, if kvm_commit takes
10 ms and QEMU calls it 20 times, "time to boot" is 200ms. kvm_commit is
not called anymore after boot, so this measurement is easy to compare
over multiple invocations of QEMU.

I ran the tests with different amount of cores: 1,2,4,8,16,32. QEMU
command is the same to replicate the bug:
./qemu-system-x86_64 --overcommit cpu-pm=on --smp $v --accel kvm
--display none >> ~/smp_$v;

Each boot is reproduced 100 times, and then from results I measure
average and stddev (in milliseconds).

ioctl:
-smp 1: Average: 2.1ms Stdev: 0.8ms
-smp 2: Average: 2.5ms Stdev: 1.5ms
-smp 4: Average: 2.2ms Stdev: 1.1ms
-smp 8: Average: 2.4ms Stdev: 0.7ms
-smp 16: Average: 3.6ms Stdev: 2.4ms (1000 repetitions)
-smp 24: Average: 12.5ms Stdev: 0.9ms (1000 repetitions)


pause/resume: (https://github.com/davidhildenbrand/qemu/tree/memslot)
-smp 1: Average: 2.2ms Stdev: 1.2ms
-smp 2: Average: 3.0ms Stdev: 1.4ms
-smp 4: Average: 3.1ms Stdev: 1.3m
-smp 8: Average: 3.4ms Stdev: 1.4ms
-smp 16: Average: 12ms Stdev: 7.0ms (1000 repetitions)
-smp 24: Average: 20ms Stdev: 7.3ms (1000 repetitions)


Above 24 vCPUs performance gets worse quickly but I think it's already
quite clear that the results for ioctl scale better as the number of
vcpus increases, while pausing the vCPUs becomes slower already with 16
vcpus.

Right, the question is if it happens sufficiently enough that we even care and if there are not ways to mitigate.

It doesn't necessarily have to scale with the #VCPUs I think. What should dominate the overall time in theory how long it takes for one VCPU (the slowest one) to leave the kernel.

I wondered if

1) it might be easier to have a single KVM mechanism/call to kick all VCPUs out of KVM instead of doing it per VCPU. That might speed up things eventually heavily already.

2) One thing I wondered is whether the biggest overhead is actually taking the locks in QEMU and not actually waiting for the VCPUs. Maybe we could optimize that as well. (for now I use one lock per VCPU because it felt like it would reduce the ioctl overhead; maybe there is a better alternative to balance between both users)

So treat my patch as a completely unoptimized version.

--
Thanks,

David / dhildenb