Re: [PATCH v4] KVM: x86: ioapic: Optimize EOI handling to reduce unnecessary VM exits
From: Sean Christopherson
Date: Mon Mar 03 2025 - 12:59:58 EST
Several minor comments. No need to post v5, I'll do so today as a small series
with preparatory patches to refactor and deduplicate the userspace vs. in-kernel
logic.
On Mon, Mar 03, 2025, weizijie wrote:
> Configuring IOAPIC routed interrupts triggers KVM to rescan all vCPU's
> ioapic_handled_vectors which is used to control which vectors need to
> trigger EOI-induced VMEXITs. If any interrupt is already in service on
> some vCPU using some vector when the IOAPIC is being rescanned, the
> vector is set to vCPU's ioapic_handled_vectors. If the vector is then
> reused by other interrupts, each of them will cause a VMEXIT even it is
> unnecessary. W/o further IOAPIC rescan, the vector remains set, and this
> issue persists, impacting guest's interrupt performance.
>
> Both
>
> commit db2bdcbbbd32 ("KVM: x86: fix edge EOI and IOAPIC reconfig race")
>
> and
>
> commit 0fc5a36dd6b3 ("KVM: x86: ioapic: Fix level-triggered EOI and
> IOAPIC reconfigure race")
>
> mentioned this issue, but it was considered as "rare" thus was not
> addressed. However in real environment this issue can actually happen
> in a well-behaved guest.
>
> Simple Fix Proposal:
> A straightforward solution is to record highest in-service IRQ that
> is pending at the time of the last scan. Then, upon the next guest
> exit, do a full KVM_REQ_SCAN_IOAPIC. This ensures that a re-scan of
> the ioapic occurs only when the recorded vector is EOI'd, and
> subsequently, the extra bits in the eoi_exit_bitmap are cleared,
> avoiding unnecessary VM exits.
Write changelogs as "commands", i.e. state what changes are actually being made,
as opposed to passively describing a proposed/possible change.
> Co-developed-by: xuyun <xuyun_xy.xy@xxxxxxxxxxxxxxxxx>
> Signed-off-by: xuyun <xuyun_xy.xy@xxxxxxxxxxxxxxxxx>
> Signed-off-by: weizijie <zijie.wei@xxxxxxxxxxxxxxxxx>
> Reviewed-by: Kai Huang <kai.huang@xxxxxxxxx>
> ---
> arch/x86/include/asm/kvm_host.h | 1 +
> arch/x86/kvm/ioapic.c | 10 ++++++++--
> arch/x86/kvm/irq_comm.c | 9 +++++++--
> arch/x86/kvm/lapic.c | 13 +++++++++++++
> 4 files changed, 29 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 0b7af5902ff7..8c50e7b4a96f 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1062,6 +1062,7 @@ struct kvm_vcpu_arch {
> #if IS_ENABLED(CONFIG_HYPERV)
> hpa_t hv_root_tdp;
> #endif
> + u8 last_pending_vector;
To be consistent with how KVM handles IRQs, this should probably be an "int" with
-1 as the "no pending EOI" value.
I also think we should go with a verbose name to try and precisely capture what
this field tracks, e.g. highest_stale_pending_ioapic_eoi. It's abusrdly long,
but with massaging and refactoring the line lengths are a non-issue, and I want
to avoid confusion with pending_ioapic_eoi and highest_isr_cache (and others).
> };
>
> struct kvm_lpage_info {
> diff --git a/arch/x86/kvm/ioapic.c b/arch/x86/kvm/ioapic.c
> index 995eb5054360..40252a800897 100644
> --- a/arch/x86/kvm/ioapic.c
> +++ b/arch/x86/kvm/ioapic.c
> @@ -297,10 +297,16 @@ void kvm_ioapic_scan_entry(struct kvm_vcpu *vcpu, ulong *ioapic_handled_vectors)
> u16 dm = kvm_lapic_irq_dest_mode(!!e->fields.dest_mode);
>
> if (kvm_apic_match_dest(vcpu, NULL, APIC_DEST_NOSHORT,
> - e->fields.dest_id, dm) ||
> - kvm_apic_pending_eoi(vcpu, e->fields.vector))
> + e->fields.dest_id, dm))
> __set_bit(e->fields.vector,
> ioapic_handled_vectors);
> + else if (kvm_apic_pending_eoi(vcpu, e->fields.vector)) {
> + __set_bit(e->fields.vector,
> + ioapic_handled_vectors);
> + vcpu->arch.last_pending_vector = e->fields.vector >
> + vcpu->arch.last_pending_vector ? e->fields.vector :
> + vcpu->arch.last_pending_vector;
Eh, no need to use a ternary operator, last_pending_vector only needs to be written
if it's changing.
> }
> }
> spin_unlock(&ioapic->lock);
> diff --git a/arch/x86/kvm/irq_comm.c b/arch/x86/kvm/irq_comm.c
> index 8136695f7b96..1d23c52576e1 100644
> --- a/arch/x86/kvm/irq_comm.c
> +++ b/arch/x86/kvm/irq_comm.c
> @@ -426,9 +426,14 @@ void kvm_scan_ioapic_routes(struct kvm_vcpu *vcpu,
>
> if (irq.trig_mode &&
> (kvm_apic_match_dest(vcpu, NULL, APIC_DEST_NOSHORT,
> - irq.dest_id, irq.dest_mode) ||
> - kvm_apic_pending_eoi(vcpu, irq.vector)))
> + irq.dest_id, irq.dest_mode)))
> __set_bit(irq.vector, ioapic_handled_vectors);
> + else if (kvm_apic_pending_eoi(vcpu, irq.vector)) {
> + __set_bit(irq.vector, ioapic_handled_vectors);
> + vcpu->arch.last_pending_vector = irq.vector >
> + vcpu->arch.last_pending_vector ? irq.vector :
> + vcpu->arch.last_pending_vector;
This is wrong. Well, maybe not "wrong" per se, but unnecessary. The trig_mode
check guards both the "new" and "old" routing cases, i.e. KVM's behavior is to
intercept EOIs for in-flight IRQs if and only if the IRQ is level-triggered.
This code really needs to be de-duplicated between the userspace and in-kernel
I/O APICs. It probably should have been de-duplicated by fceb3a36c29a ("KVM: x86:
ioapic: Fix level-triggered EOI and userspace I/OAPIC reconfigure race"), but it's
a much more pressing issue now.
> + }
> }
> }
> srcu_read_unlock(&kvm->irq_srcu, idx);
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index a009c94c26c2..7c540a0eb340 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -1466,6 +1466,19 @@ static void kvm_ioapic_send_eoi(struct kvm_lapic *apic, int vector)
> if (!kvm_ioapic_handles_vector(apic, vector))
> return;
>
> + /*
> + * When there are instances where ioapic_handled_vectors is
> + * set due to pending interrupts, clean up the record and do
> + * a full KVM_REQ_SCAN_IOAPIC.
> + * This ensures the vector is cleared in the vCPU's ioapic_handled_vectors,
> + * if the vector is reused by non-IOAPIC interrupts, avoiding unnecessary
> + * EOI-induced VMEXITs for that vector.
> + */
> + if (apic->vcpu->arch.last_pending_vector == vector) {
> + apic->vcpu->arch.last_pending_vector = 0;
I think it makes sense to reset the field when KVM_REQ_SCAN_IOAPIC, mainly so
that it's more obviously correct, i.e. so that it's easier to see that the field
is reset immediately prior to scanning, along with the bitmap itself.
> + kvm_make_request(KVM_REQ_SCAN_IOAPIC, apic->vcpu);
> + }
> +
> /* Request a KVM exit to inform the userspace IOAPIC. */
> if (irqchip_split(apic->vcpu->kvm)) {
> apic->vcpu->arch.pending_ioapic_eoi = vector;
> --
> 2.43.5
>