Re: [PATCH v2 0/9] sched/kvm: Semantics-aware vCPU scheduling for oversubscribed KVM

From: Wanpeng Li

Date: Wed Apr 01 2026 - 06:05:02 EST


Hi Sean,
On Fri, 13 Mar 2026 at 09:13, Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
>
> On Fri, Dec 19, 2025, Wanpeng Li wrote:
> > Part 2: KVM IPI-Aware Directed Yield (patches 6-9)
> >
> > Enhance kvm_vcpu_on_spin() with lightweight IPI tracking to improve
> > directed yield candidate selection. Track sender/receiver relationships
> > when IPIs are delivered and use this information to prioritize yield
> > targets.
> >
> > The tracking mechanism:
> >
> > - Hooks into kvm_irq_delivery_to_apic() to detect unicast fixed IPIs (the
> > common case for inter-processor synchronization). When exactly one
> > destination vCPU receives an IPI, record the sender->receiver relationship
> > with a monotonic timestamp.
> >
> > In high VM density scenarios, software-based IPI tracking through
> > interrupt delivery interception becomes particularly valuable. It
> > captures precise sender/receiver relationships that can be leveraged
> > for intelligent scheduling decisions, providing performance benefits
> > that complement or even exceed hardware-accelerated interrupt delivery
> > in overcommitted environments.
> >
> > - Uses lockless READ_ONCE/WRITE_ONCE accessors for minimal overhead. The
> > per-vCPU ipi_context structure is carefully designed to avoid cache line
> > bouncing.
> >
> > - Implements a short recency window (50ms default) to avoid stale IPI
> > information inflating boost priority on throughput-sensitive workloads.
> > Old IPI relationships are naturally aged out.
> >
> > - Clears IPI context on EOI with two-stage precision: unconditionally clear
> > the receiver's context (it processed the interrupt), but only clear the
> > sender's pending flag if the receiver matches and the IPI is recent. This
> > prevents unrelated EOIs from prematurely clearing valid IPI state.
>
> That all relies on lack of IPI and EOI virtualization, which seems very
> counter-productive given the way hardware is headed.

I think there is an important distinction here. APICv / posted
interrupts accelerate IPI *delivery*, but they do not help with the
host-side *scheduling decision* in kvm_vcpu_on_spin(). A posted
interrupt can land in a not-yet-scheduled vCPU's PIR, but that vCPU
still won't process it until it actually gets CPU time. IPI tracking
targets exactly this gap: which vCPU should we yield to right now.

In high VM density / overcommitted scenarios, APICv's advantage
narrows precisely because the bottleneck shifts from IPI delivery
latency to *scheduling latency* — the target vCPU may have its posted
interrupt sitting in PIR but cannot process it because it is competing
for physical CPU time with many other vCPUs. In that regime, making a
better yield-to decision on the host side has a more direct impact on
end-to-end IPI response time than faster hardware delivery to a vCPU
that isn't running.

So I would not characterize IPI tracking as a workaround for lack of
hardware virtualization support. It addresses an orthogonal problem —
host-side scheduling decisions — that hardware IPI acceleration does
not solve. The two are complementary: APICv makes delivery fast when
the target is running; IPI-aware directed yield makes scheduling
better when the target is not running.

>
> My reaction to all of this is that in the long run, we'd be far better off getting
> the guest to "cooperate" in the sense of communicating intent, status, etc. As

I agree that guest cooperation / PV scheduling hints could provide
even richer semantic information for these decisions in the long term.
The host-side IPI tracking approach has the practical advantage of
working with unmodified guests and mixed-OS deployments today, which
covers a large fraction of real-world overcommitted environments.

These two directions are not mutually exclusive — PV hints can coexist
with host-side heuristics, and the IPI tracking infrastructure could
serve as a useful fallback or baseline even after PV scheduling is
available, for guests that do not support it.

Regards,
Wanpeng