Re: [PATCH v4 4/4] KVM: Avoid synchronize_srcu() in kvm_io_bus_register_dev()
From: Nikita Kalyazin
Date: Mon Feb 16 2026 - 12:54:14 EST
On 13/02/2026 23:20, Sean Christopherson wrote:
On Fri, Feb 13, 2026, Nikita Kalyazin wrote:
On 09/09/2025 11:00, Keir Fraser wrote:
Device MMIO registration may happen quite frequently during VM boot,
and the SRCU synchronization each time has a measurable effect
on VM startup time. In our experiments it can account for around 25%
of a VM's startup time.
Replace the synchronization with a deferred free of the old kvm_io_bus
structure.
Hi,
We noticed that this change introduced a regression of ~20 ms to the first
KVM_CREATE_VCPU call of a VM, which is significant for our use case.
Before the patch:
45726 14:45:32.914330 ioctl(25, KVM_CREATE_VCPU, 0) = 28 <0.000137>
45726 14:45:32.914533 ioctl(25, KVM_CREATE_VCPU, 1) = 30 <0.000046>
After the patch:
30295 14:47:08.057412 ioctl(25, KVM_CREATE_VCPU, 0) = 28 <0.025182>
30295 14:47:08.082663 ioctl(25, KVM_CREATE_VCPU, 1) = 30 <0.000031>
The reason, as I understand, it happens is call_srcu() called from
kvm_io_bus_register_dev() are adding callbacks to be called after a normal
GP, which is 10 ms with HZ=100. The subsequent synchronize_srcu_expedited()
called from kvm_swap_active_memslots() (from KVM_CREATE_VCPU) has to wait
for the normal GP to complete before making progress. I don't fully
understand why the delay is consistently greater than 1 GP, but that's what
we see across our testing scenarios.
I verified that the problem is relaxed if the GP is reduced by configuring
HZ=1000. In that case, the regression is in the order of 1 ms.
It looks like in our case we don't benefit much from the intended
optimisation as the number of device MMIO registrations is limited and and
they don't cost us much (each takes at most 16 us, but most commonly ~6 us):
Maybe differences in platforms for arm64 vs x86?
Tested on ARM, and indeed kvm_io_bus_register_dev are occurring after KVM_CREATE_VCPU, and the patch produces a visible optimisation:
Without the patch (15-23 us per call):
firecracker 19916 [033] 404.518430: probe:kvm_vm_ioctl_create_vcpu: (ffff800080059b18)
firecracker 19916 [033] 404.518446: probe:kvm_vm_ioctl_create_vcpu: (ffff800080059b18)
firecracker 19916 [033] 404.518462: probe:kvm_io_bus_register_dev: (ffff80008005f0e8)
firecracker 19916 [032] 404.518495: probe:kvm_io_bus_register_dev__return: (ffff80008005f0e8 <- ffff8000800a198c)
firecracker 19916 [032] 404.518498: probe:kvm_io_bus_register_dev: (ffff80008005f0e8)
firecracker 19916 [033] 404.518521: probe:kvm_io_bus_register_dev__return: (ffff80008005f0e8 <- ffff8000800a198c)
firecracker 19916 [033] 404.518524: probe:kvm_io_bus_register_dev: (ffff80008005f0e8)
firecracker 19916 [032] 404.518539: probe:kvm_io_bus_register_dev__return: (ffff80008005f0e8 <- ffff8000800a6d2c)
firecracker 19916 [032] 404.526900: probe:kvm_io_bus_register_dev: (ffff80008005f0e8)
firecracker 19916 [033] 404.526924: probe:kvm_io_bus_register_dev__return: (ffff80008005f0e8 <- ffff800080060168)
firecracker 19916 [033] 404.526926: probe:kvm_io_bus_register_dev: (ffff80008005f0e8)
firecracker 19916 [032] 404.526941: probe:kvm_io_bus_register_dev__return: (ffff80008005f0e8 <- ffff800080060168)
fc_vcpu 0 19924 [035] 404.530829: probe:kvm_io_bus_register_dev: (ffff80008005f0e8)
fc_vcpu 0 19924 [035] 404.530848: probe:kvm_io_bus_register_dev__return: (ffff80008005f0e8 <- ffff80008009f6b4)
With the patch (1-6 us per call):
firecracker 22806 [032] 427.687157: probe:kvm_vm_ioctl_create_vcpu: (ffff800080059b38)
firecracker 22806 [032] 427.687174: probe:kvm_vm_ioctl_create_vcpu: (ffff800080059b38)
firecracker 22806 [032] 427.687193: probe:kvm_io_bus_register_dev: (ffff80008005f128)
firecracker 22806 [032] 427.687196: probe:kvm_io_bus_register_dev__return: (ffff80008005f128 <- ffff8000800a19cc)
firecracker 22806 [032] 427.687196: probe:kvm_io_bus_register_dev: (ffff80008005f128)
firecracker 22806 [032] 427.687197: probe:kvm_io_bus_register_dev__return: (ffff80008005f128 <- ffff8000800a19cc)
firecracker 22806 [032] 427.687201: probe:kvm_io_bus_register_dev: (ffff80008005f128)
firecracker 22806 [032] 427.687202: probe:kvm_io_bus_register_dev__return: (ffff80008005f128 <- ffff8000800a6d6c)
firecracker 22806 [029] 427.707660: probe:kvm_io_bus_register_dev: (ffff80008005f128)
firecracker 22806 [029] 427.707666: probe:kvm_io_bus_register_dev__return: (ffff80008005f128 <- ffff8000800601a8)
firecracker 22806 [029] 427.707667: probe:kvm_io_bus_register_dev: (ffff80008005f128)
firecracker 22806 [029] 427.707668: probe:kvm_io_bus_register_dev__return: (ffff80008005f128 <- ffff8000800601a8)
fc_vcpu 0 22829 [030] 427.711642: probe:kvm_io_bus_register_dev: (ffff80008005f128)
fc_vcpu 0 22829 [030] 427.711645: probe:kvm_io_bus_register_dev__return: (ffff80008005f128 <- ffff80008009f6f4)
Also, it is the KVM_SET_USER_MEMORY_REGION (not KVM_CREATE_VCPU) that is hit on ARM (but seems to be for the same reason):
45736 17:30:10.251430 ioctl(17, KVM_SET_USER_MEMORY_REGION, {slot=0, flags=0, guest_phys_addr=0x80000000, memory_size=12884901888, userspace_addr=0xfffcbedd6000}) = 0 <0.021021>
vs
30694 17:33:01.128985 ioctl(17, KVM_SET_USER_MEMORY_REGION, {slot=0, flags=0, guest_phys_addr=0x80000000, memory_size=12884901888, userspace_addr=0xfffc91fc9000}) = 0 <0.000016>
I am not aware of way to make it fast for both use cases and would be more
than happy to hear about possible solutions.
What if we key off of vCPUS being created? The motivation for Keir's change was
to avoid stalling during VM boot, i.e. *after* initial VM creation.
It doesn't work as is on x86 because the delay we're seeing occurs after the created_cpus gets incremented so it doesn't allow to differentiate the two cases (below is kvm_vm_ioctl_create_vcpu):
kvm->created_vcpus++; // <===== incremented here
mutex_unlock(&kvm->lock);
vcpu = kmem_cache_zalloc(kvm_vcpu_cache, GFP_KERNEL_ACCOUNT);
if (!vcpu) {
r = -ENOMEM;
goto vcpu_decrement;
}
BUILD_BUG_ON(sizeof(struct kvm_run) > PAGE_SIZE);
page = alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO);
if (!page) {
r = -ENOMEM;
goto vcpu_free;
}
vcpu->run = page_address(page);
kvm_vcpu_init(vcpu, kvm, id);
r = kvm_arch_vcpu_create(vcpu); // <===== the delay is here
firecracker 583 [001] 151.297145: probe:synchronize_srcu_expedited: (ffffffff813e5cf0)
ffffffff813e5cf1 synchronize_srcu_expedited+0x1 ([kernel.kallsyms])
ffffffff81234986 kvm_swap_active_memslots+0x136 ([kernel.kallsyms])
ffffffff81236cdd kvm_set_memslot+0x1cd ([kernel.kallsyms])
ffffffff81237518 kvm_set_memory_region.part.0+0x478 ([kernel.kallsyms])
ffffffff81264dbc __x86_set_memory_region+0xec ([kernel.kallsyms])
ffffffff8127e2dc kvm_alloc_apic_access_page+0x5c ([kernel.kallsyms])
ffffffff812b9ed3 vmx_vcpu_create+0x193 ([kernel.kallsyms])
ffffffff8126788a kvm_arch_vcpu_create+0x1da ([kernel.kallsyms])
ffffffff8123c54c kvm_vm_ioctl+0x5fc ([kernel.kallsyms])
ffffffff8167b331 __x64_sys_ioctl+0x91 ([kernel.kallsyms])
ffffffff8251a89c do_syscall_64+0x4c ([kernel.kallsyms])
ffffffff8100012b entry_SYSCALL_64_after_hwframe+0x76 ([kernel.kallsyms])
6512de ioctl+0x32 (/mnt/host/firecracker)
d99a7 std::rt::lang_start+0x37 (/mnt/host/firecracker)
Also, given that it stumbles after the KVM_CREATE_VCPU on ARM (in KVM_SET_USER_MEMORY_REGION), it doesn't look like a universal solution.
--
From: Sean Christopherson <seanjc@xxxxxxxxxx>
Date: Fri, 13 Feb 2026 15:15:01 -0800
Subject: [PATCH] KVM: Synchronize SRCU on I/O device registration if vCPUs
haven't been created
TODO: Write a changelog if this works.
Fixes: 7d9a0273c459 ("KVM: Avoid synchronize_srcu() in kvm_io_bus_register_dev()")
Reported-by: Nikita Kalyazin <kalyazin@xxxxxxxxxx>
Closes: https://lkml.kernel.org/r/a84ddba8-12da-489a-9dd1-ccdf7451a1ba%40amazon.com
Cc: stable@xxxxxxxxxxxxxxx
Signed-off-by: Sean Christopherson <seanjc@xxxxxxxxxx>
---
virt/kvm/kvm_main.c | 25 ++++++++++++++++++++++++-
1 file changed, 24 insertions(+), 1 deletion(-)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 571cf0d6ec01..043b1c3574ab 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -6027,7 +6027,30 @@ int kvm_io_bus_register_dev(struct kvm *kvm, enum kvm_bus bus_idx, gpa_t addr,
memcpy(new_bus->range + i + 1, bus->range + i,
(bus->dev_count - i) * sizeof(struct kvm_io_range));
rcu_assign_pointer(kvm->buses[bus_idx], new_bus);
- call_srcu(&kvm->srcu, &bus->rcu, __free_bus);
+
+ /*
+ * To optimize VM creation *and* boot time, use different tactics for
+ * safely freeing the old bus based on where the VM is at in its
+ * lifecycle. If vCPUs haven't yet been created, simply synchronize
+ * and free, as there are unlikely to be active SRCU readers; if not,
+ * defer freeing the bus via SRCU callback.
+ *
+ * If there are active SRCU readers, synchronizing will stall until the
+ * current grace period completes, which can meaningfully impact boot
+ * time for VMs that trigger a large number of registrations.
+ *
+ * If there aren't SRCU readers, using an SRCU callback can be a net
+ * negative due to starting a grace period of its own, which in turn
+ * can unnecessarily cause a future synchronization to stall. E.g. if
+ * devices are registered before memslots are created, then creating
+ * the first memslot will have to wait for a superfluous grace period.
+ */
+ if (!READ_ONCE(kvm->created_vcpus)) {
+ synchronize_srcu_expedited(&kvm->srcu);
+ kfree(bus);
+ } else {
+ call_srcu(&kvm->srcu, &bus->rcu, __free_bus);
+ }
return 0;
}
base-commit: 183bb0ce8c77b0fd1fb25874112bc8751a461e49
--