Re: [PATCH 3/3] KVM: x86: Decouple APICv activation state from apicv_inhibit_reasons
From: Naveen N Rao
Date: Tue Feb 04 2025 - 06:16:03 EST
On Mon, Feb 03, 2025 at 03:46:48PM -0800, Sean Christopherson wrote:
> On Mon, Feb 03, 2025, Paolo Bonzini wrote:
> > On 2/3/25 19:45, Sean Christopherson wrote:
> > > Unless there's a very, very good reason to support a use case that generates
> > > ExtInts during boot, but _only_ during boot, and otherwise doesn't have any APICv
> > > ihibits, I'm leaning towards making SVM's IRQ window inhibit sticky, i.e. never
> > > clear it.
> >
> > BIOS tends to use PIT, so that may be too much. With respect to Naveen's report
> > of contention on apicv_update_lock, I would go with the sticky-bit idea but apply
> > it to APICV_INHIBIT_REASON_PIT_REINJ.
>
> That won't work, at least not with yet more changes, because KVM creates the
> in-kernel PIT with reinjection enabled by default. The stick-bit idea is that
> if a bit is set and can never be cleared, then there's no need to track new
> updates. Since userspace needs to explicitly disable reinjection, the inhibit
> can't be sticky.
>
> I assume We could fudge around that easily enough by deferring the inhibit until
> a vCPU is created (or run?), but piggybacking PIT_REINJ won't help the userspace
> I/O APIC case.
As a separate change, I have been testing a patch that moves the
PIT_REINJ inhibit from PIT creation to the point at which the guest
actually programs it so that default guest configurations can utilize
AVIC:
diff --git a/arch/x86/kvm/i8254.c b/arch/x86/kvm/i8254.c
index cd57a517d04a..8f959de7ff32 100644
--- a/arch/x86/kvm/i8254.c
+++ b/arch/x86/kvm/i8254.c
@@ -235,6 +235,7 @@ static void destroy_pit_timer(struct kvm_pit *pit)
{
hrtimer_cancel(&pit->pit_state.timer);
kthread_flush_work(&pit->expired);
+ kvm_clear_apicv_inhibit(pit->kvm, APICV_INHIBIT_REASON_PIT_REINJ);
}
static void pit_do_work(struct kthread_work *work)
@@ -296,22 +297,12 @@ void kvm_pit_set_reinject(struct kvm_pit *pit, bool reinject)
if (atomic_read(&ps->reinject) == reinject)
return;
- /*
- * AMD SVM AVIC accelerates EOI write and does not trap.
- * This cause in-kernel PIT re-inject mode to fail
- * since it checks ps->irq_ack before kvm_set_irq()
- * and relies on the ack notifier to timely queue
- * the pt->worker work iterm and reinject the missed tick.
- * So, deactivate APICv when PIT is in reinject mode.
- */
if (reinject) {
- kvm_set_apicv_inhibit(kvm, APICV_INHIBIT_REASON_PIT_REINJ);
/* The initial state is preserved while ps->reinject == 0. */
kvm_pit_reset_reinject(pit);
kvm_register_irq_ack_notifier(kvm, &ps->irq_ack_notifier);
kvm_register_irq_mask_notifier(kvm, 0, &pit->mask_notifier);
} else {
- kvm_clear_apicv_inhibit(kvm, APICV_INHIBIT_REASON_PIT_REINJ);
kvm_unregister_irq_ack_notifier(kvm, &ps->irq_ack_notifier);
kvm_unregister_irq_mask_notifier(kvm, 0, &pit->mask_notifier);
}
@@ -358,6 +349,16 @@ static void create_pit_timer(struct kvm_pit *pit, u32 val, int is_period)
}
}
+ /*
+ * AMD SVM AVIC accelerates EOI write and does not trap. This causes
+ * in-kernel PIT re-inject mode to fail since it checks ps->irq_ack
+ * before kvm_set_irq() and relies on the ack notifier to timely queue
+ * the pt->worker work item and reinject the missed tick. So,
+ * deactivate APICv when PIT is in reinject mode, but only when the
+ * timer is actually armed.
+ */
+ kvm_set_apicv_inhibit(kvm, APICV_INHIBIT_REASON_PIT_REINJ);
+
hrtimer_start(&ps->timer, ktime_add_ns(ktime_get(), interval),
HRTIMER_MODE_ABS);
}
Is that reasonable?
If it is, or if we choose to delay PIT_REINJ inhibit to vcpu creation time,
then making PT_REINJ or IRQWIN inhibits sticky will prevent AVIC from being
enabled later on. I can see in my tests that BIOS (both seabios and edk2)
programs the PIT though Linux guest itself doesn't (unless -no-hpet is used).
The above change will allow AVIC to be used in Linux guests even when a PIT is
present.
>
> > I don't love adding another inhibit reason but, together, these two should
> > remove the contention on apicv_update_lock. Another idea could be to move
> > IRQWIN to per-vCPU reason but Maxim tells me that it's not so easy.
>
> Oh, yeah, that reminds me of the other reason I would vote for a sticky flag:
> if inhibition really is toggling rapidly, performance is going to be quite bad
> because inhibiting APICv requires (a) zapping APIC SPTEs and (b) serializing
> writers if multiple vCPUs trigger the 0=>1 transition.
>
> And there's some amount of serialization even if there's only a single writer,
> as KVM kicks all vCPUs to toggle APICv (and again to flush TLBs, if necessary).
>
> Hmm, something doesn't add up. Naveen's changelog says:
>
> KVM additionally inhibits AVIC for requesting a IRQ window every time it has
> to inject external interrupts resulting in a barrage of inhibits being set and
> cleared. This shows significant performance degradation compared to AVIC being
> disabled, due to high contention on apicv_update_lock.
>
> But if this is a "real world" use case where the only source of ExtInt is the
> PIT, and kernels typically only wire up the PIT to the BSP, why is there
> contention on apicv_update_lock? APICv isn't actually being toggled, so readers
> blocking writers to handle KVM_REQ_APICV_UPDATE shouldn't be a problem.
>
> Naveen, do you know why there's a contention on apicv_update_lock? Are multiple
> vCPUs actually trying to inject ExtInt?
Apologies for the confusion, I should probably have said "device
interrupts" (sorry, I'm still trying to get my terminology right). I
have described the test scenario in the cover letter, but to add more
details: the test involves two guests on the same host (using network
tap devices with vhost and mq) with one guest running netperf TCP_RR to
the other guest. Trimmed qemu command:
qemu-system-x86_64 -machine q35,accel=kvm,kernel-irqchip=on \
-smp 16,cores=8,threads=2 -cpu host,topoext=on,x2apic=on -m 32G \
-drive file=~/ubuntu_vm1.img,if=virtio,format=qcow2 \
-netdev tap,id=v0,br=virbr0,script=~/qemu-ifup,downscript=~/qemu-ifdown,vhost=on,queues=16 \
-device virtio-net-pci,netdev=v0,mac=52:54:00:12:34:11,mq=on,vectors=34
So, not a real world workload per se, though there may be use cases for guests
having to handle large number of packets.
You're right -- APICv isn't actually being toggled, but IRQWIN inhibit is
constantly being set and cleared while trying to inject device interrupts into
the guests. The places where we set/clear IRQWIN inhibit has comments
indicating that it is only required for ExtINT, though that's not actually the
case here.
What is actually happening is that since the PIT is in reinject mode, APICv is
not active in the guest. When that happens, kvm_cpu_has_injectable_intr()
returns true when any interrupt is pending:
/*
* check if there is injectable interrupt:
* when virtual interrupt delivery enabled,
* interrupt from apic will handled by hardware,
* we don't need to check it here.
*/
int kvm_cpu_has_injectable_intr(struct kvm_vcpu *v)
{
if (kvm_cpu_has_extint(v))
return 1;
if (!is_guest_mode(v) && kvm_vcpu_apicv_active(v))
return 0;
return kvm_apic_has_interrupt(v) != -1; /* LAPIC */
}
The second if condition fails since APICv is not active. So,
kvm_check_and_inject_events() calls enable_irq_window() to request for an IRQ
window to inject those interrupts. That results in all vCPUs trying to acquire
apicv_update_lock for updating the inhibit reason, though APICv state itself
isn't changing and we are not requesting for KVM_REQ_APICV_UPDATE.
Thanks,
Naveen