Re: [PATCH 3/3] KVM: x86: Decouple APICv activation state from apicv_inhibit_reasons

From: Sean Christopherson
Date: Tue Feb 04 2025 - 14:19:14 EST


On Mon, Feb 03, 2025, Maxim Levitsky wrote:
> On Mon, 2025-02-03 at 15:46 -0800, Sean Christopherson wrote:
> > On Mon, Feb 03, 2025, Paolo Bonzini wrote:
> > > On 2/3/25 19:45, Sean Christopherson wrote:
> > > > Unless there's a very, very good reason to support a use case that generates
> > > > ExtInts during boot, but _only_ during boot, and otherwise doesn't have any APICv
> > > > ihibits, I'm leaning towards making SVM's IRQ window inhibit sticky, i.e. never
> > > > clear it.
> > >
> > > BIOS tends to use PIT, so that may be too much. With respect to Naveen's report
> > > of contention on apicv_update_lock, I would go with the sticky-bit idea but apply
> > > it to APICV_INHIBIT_REASON_PIT_REINJ.
> >
> > That won't work, at least not with yet more changes, because KVM creates the
> > in-kernel PIT with reinjection enabled by default. The stick-bit idea is that
> > if a bit is set and can never be cleared, then there's no need to track new
> > updates. Since userspace needs to explicitly disable reinjection, the inhibit
> > can't be sticky.
> I confirmed this with a trace, this is indeed the case.
>
> >
> > I assume We could fudge around that easily enough by deferring the inhibit until
> > a vCPU is created (or run?), but piggybacking PIT_REINJ won't help the userspace
> > I/O APIC case.
> >
> > > I don't love adding another inhibit reason but, together, these two should
> > > remove the contention on apicv_update_lock. Another idea could be to move
> > > IRQWIN to per-vCPU reason but Maxim tells me that it's not so easy.
>
> I retract this statement, it was based on my knowledge from back when I
> implemented it.
>
> Looking at the current code again, this should be possible and can be a nice
> cleanup regardless.
>
> (Or I just might have forgotten the reason that made me think back then that
> this is not worth it, because I do remember well that I wanted to make IRQWIN
> inhibit to be per vcpu)

The complication is the APIC page. That's not a problem for vCPUs running in L2
because they'll use a different MMU, i.e. a different set of SPTEs that never map
the APIC backing page. At least, that's how it's supposed to work[*]. ;-)

For IRQWIN, turning off APICv for the current vCPU will leave the APIC SPTEs in
place and so KVM will fail to intercept accesses to the APIC page. And making
IRQWIN a per-vCPU inhibit won't help performance in the case where there is no
other inhibit, because (a) toggling it on/off requires taking mmu_lock for writing
and doing a remote TLB flush, and (b) unless the guest is doing something bizarre,
only one vCPU will be receiving ExtInt IRQs. I.e. I don't think trying to make
IRQWIN a pure per-vCPU inhibit is necessary for performance.

After fiddling with a bunch of ideas, I think the best approach to address both
issues is to add a counter for the IRQ window (addresses the per-vCPU aspect of
IRQ windows), set/clear the IRQWIN inhibit according to the counter when *any*
inhibit changes, and then force an immediate update if and only if the count hits
a 0<=>1 transition *and* there is no other inhibit. That would address the flaw
Naveen found without needing to make PIT_REINJ sticky.

Guarding the count with apicv_update_lock held for read ensures that if there is
a racing change to a different inhibit, that either kvm_inc_or_dec_irq_window_inhibit()
will see no inhibits and go down the slow path, or __kvm_set_or_clear_apicv_inhibit()
will set IRQWIN accordingly.

Compile tested only (and probably needs to be split into multiple patches). I'll
try to take it for a spin later today.

[*] https://lore.kernel.org/all/20250130010825.220346-1-seanjc@xxxxxxxxxx

---
arch/x86/include/asm/kvm_host.h | 13 ++++++++++
arch/x86/kvm/svm/svm.c | 43 +++++++++------------------------
arch/x86/kvm/svm/svm.h | 1 +
arch/x86/kvm/x86.c | 36 ++++++++++++++++++++++++++-
4 files changed, 61 insertions(+), 32 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 5193c3dfbce1..9e3465e70a0a 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1365,6 +1365,7 @@ struct kvm_arch {
/* Protects apicv_inhibit_reasons */
struct rw_semaphore apicv_update_lock;
unsigned long apicv_inhibit_reasons;
+ atomic_t apicv_irq_window;

gpa_t wall_clock;

@@ -2203,6 +2204,18 @@ static inline void kvm_clear_apicv_inhibit(struct kvm *kvm,
kvm_set_or_clear_apicv_inhibit(kvm, reason, false);
}

+void kvm_inc_or_dec_irq_window_inhibit(struct kvm *kvm, bool inc);
+
+static inline void kvm_inc_apicv_irq_window(struct kvm *kvm)
+{
+ kvm_inc_or_dec_irq_window_inhibit(kvm, true);
+}
+
+static inline void kvm_dec_apicv_irq_window(struct kvm *kvm)
+{
+ kvm_inc_or_dec_irq_window_inhibit(kvm, false);
+}
+
unsigned long __kvm_emulate_hypercall(struct kvm_vcpu *vcpu, unsigned long nr,
unsigned long a0, unsigned long a1,
unsigned long a2, unsigned long a3,
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 7640a84e554a..668db3bfff3d 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -1636,9 +1636,13 @@ static void svm_set_vintr(struct vcpu_svm *svm)
struct vmcb_control_area *control;

/*
- * The following fields are ignored when AVIC is enabled
+ * vIRQ is ignored by hardware AVIC is enabled, and so AVIC must be
+ * inhibited to detect the interrupt window.
*/
- WARN_ON(kvm_vcpu_apicv_activated(&svm->vcpu));
+ if (enable_apicv && !is_guest_mode(&svm->vcpu)) {
+ svm->avic_irq_window = true;
+ kvm_inc_apicv_irq_window(svm->vcpu.kvm);
+ }

svm_set_intercept(svm, INTERCEPT_VINTR);

@@ -1666,6 +1670,11 @@ static void svm_set_vintr(struct vcpu_svm *svm)

static void svm_clear_vintr(struct vcpu_svm *svm)
{
+ if (svm->avic_irq_window && !is_guest_mode(&svm->vcpu)) {
+ svm->avic_irq_window = false;
+ kvm_dec_apicv_irq_window(svm->vcpu.kvm);
+ }
+
svm_clr_intercept(svm, INTERCEPT_VINTR);

/* Drop int_ctl fields related to VINTR injection. */
@@ -3219,20 +3228,6 @@ static int interrupt_window_interception(struct kvm_vcpu *vcpu)
kvm_make_request(KVM_REQ_EVENT, vcpu);
svm_clear_vintr(to_svm(vcpu));

- /*
- * If not running nested, for AVIC, the only reason to end up here is ExtINTs.
- * In this case AVIC was temporarily disabled for
- * requesting the IRQ window and we have to re-enable it.
- *
- * If running nested, still remove the VM wide AVIC inhibit to
- * support case in which the interrupt window was requested when the
- * vCPU was not running nested.
-
- * All vCPUs which run still run nested, will remain to have their
- * AVIC still inhibited due to per-cpu AVIC inhibition.
- */
- kvm_clear_apicv_inhibit(vcpu->kvm, APICV_INHIBIT_REASON_IRQWIN);
-
++vcpu->stat.irq_window_exits;
return 1;
}
@@ -3879,22 +3874,8 @@ static void svm_enable_irq_window(struct kvm_vcpu *vcpu)
* enabled, the STGI interception will not occur. Enable the irq
* window under the assumption that the hardware will set the GIF.
*/
- if (vgif || gif_set(svm)) {
- /*
- * IRQ window is not needed when AVIC is enabled,
- * unless we have pending ExtINT since it cannot be injected
- * via AVIC. In such case, KVM needs to temporarily disable AVIC,
- * and fallback to injecting IRQ via V_IRQ.
- *
- * If running nested, AVIC is already locally inhibited
- * on this vCPU, therefore there is no need to request
- * the VM wide AVIC inhibition.
- */
- if (!is_guest_mode(vcpu))
- kvm_set_apicv_inhibit(vcpu->kvm, APICV_INHIBIT_REASON_IRQWIN);
-
+ if (vgif || gif_set(svm))
svm_set_vintr(svm);
- }
}

static void svm_enable_nmi_window(struct kvm_vcpu *vcpu)
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 9d7cdb8fbf87..8eefed0a865a 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -323,6 +323,7 @@ struct vcpu_svm {

bool guest_state_loaded;

+ bool avic_irq_window;
bool x2avic_msrs_intercepted;

/* Guest GIF value, used when vGIF is not enabled */
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index b2d9a16fd4d3..7388f4cfe468 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10604,7 +10604,11 @@ void __kvm_set_or_clear_apicv_inhibit(struct kvm *kvm,

old = new = kvm->arch.apicv_inhibit_reasons;

- set_or_clear_apicv_inhibit(&new, reason, set);
+ if (reason != APICV_INHIBIT_REASON_IRQWIN)
+ set_or_clear_apicv_inhibit(&new, reason, set);
+
+ set_or_clear_apicv_inhibit(&new, APICV_INHIBIT_REASON_IRQWIN,
+ atomic_read(&kvm->arch.apicv_irq_window));

if (!!old != !!new) {
/*
@@ -10645,6 +10649,36 @@ void kvm_set_or_clear_apicv_inhibit(struct kvm *kvm,
}
EXPORT_SYMBOL_GPL(kvm_set_or_clear_apicv_inhibit);

+void kvm_inc_or_dec_irq_window_inhibit(struct kvm *kvm, bool inc)
+{
+ bool toggle;
+
+ /*
+ * The IRQ window inhibit has a cyclical dependency of sorts, as KVM
+ * needs to manually inject IRQs and thus detect interrupt windows if
+ * APICv is disabled/inhibitied. To avoid thrashing if the IRQ window
+ * is being requested because APICv is already inhibited, toggle the
+ * actual inhibit (and take the lock for write) if and only if there's
+ * no other inhibit. KVM evaluates the IRQ window count when _any_
+ * inhibit changes, i.e. the IRQ window inhibit can be lazily updated
+ * on the next inhibit change (if one ever occurs).
+ */
+ down_read(&kvm->arch.apicv_update_lock);
+
+ if (inc)
+ toggle = atomic_inc_return(&kvm->arch.apicv_irq_window) == 1;
+ else
+ toggle = atomic_dec_return(&kvm->arch.apicv_irq_window) == 0;
+
+ toggle = toggle && !(kvm->arch.apicv_inhibit_reasons & ~BIT(APICV_INHIBIT_REASON_IRQWIN));
+
+ up_read(&kvm->arch.apicv_update_lock);
+
+ if (toggle)
+ kvm_set_or_clear_apicv_inhibit(kvm, APICV_INHIBIT_REASON_IRQWIN, inc);
+}
+EXPORT_SYMBOL_GPL(kvm_inc_or_dec_irq_window_inhibit);
+
static void vcpu_scan_ioapic(struct kvm_vcpu *vcpu)
{
if (!kvm_apic_present(vcpu))

base-commit: eb723766b1030a23c38adf2348b7c3d1409d11f0
--