Re: [PATCH v2 0/9] sched/kvm: Semantics-aware vCPU scheduling for oversubscribed KVM

From: Sean Christopherson

Date: Thu Apr 02 2026 - 19:43:28 EST

On Wed, Apr 01, 2026, Wanpeng Li wrote:
> Hi Sean，
> On Fri, 13 Mar 2026 at 09:13, Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
> >
> > On Fri, Dec 19, 2025, Wanpeng Li wrote:
> > > Part 2: KVM IPI-Aware Directed Yield (patches 6-9)
> > >
> > > Enhance kvm_vcpu_on_spin() with lightweight IPI tracking to improve
> > > directed yield candidate selection. Track sender/receiver relationships
> > > when IPIs are delivered and use this information to prioritize yield
> > > targets.
> > >
> > > The tracking mechanism:
> > >
> > > - Hooks into kvm_irq_delivery_to_apic() to detect unicast fixed IPIs (the
> > > common case for inter-processor synchronization). When exactly one
> > > destination vCPU receives an IPI, record the sender->receiver relationship
> > > with a monotonic timestamp.
> > >
> > > In high VM density scenarios, software-based IPI tracking through
> > > interrupt delivery interception becomes particularly valuable. It
> > > captures precise sender/receiver relationships that can be leveraged
> > > for intelligent scheduling decisions, providing performance benefits
> > > that complement or even exceed hardware-accelerated interrupt delivery
> > > in overcommitted environments.
> > >
> > > - Uses lockless READ_ONCE/WRITE_ONCE accessors for minimal overhead. The
> > > per-vCPU ipi_context structure is carefully designed to avoid cache line
> > > bouncing.
> > >
> > > - Implements a short recency window (50ms default) to avoid stale IPI
> > > information inflating boost priority on throughput-sensitive workloads.
> > > Old IPI relationships are naturally aged out.
> > >
> > > - Clears IPI context on EOI with two-stage precision: unconditionally clear
> > > the receiver's context (it processed the interrupt), but only clear the
> > > sender's pending flag if the receiver matches and the IPI is recent. This
> > > prevents unrelated EOIs from prematurely clearing valid IPI state.
> >
> > That all relies on lack of IPI and EOI virtualization, which seems very
> > counter-productive given the way hardware is headed.
>
> I think there is an important distinction here. APICv / posted
> interrupts accelerate IPI *delivery*, but they do not help with the
> host-side *scheduling decision* in kvm_vcpu_on_spin().

I know, but that doesn't change the reality of where hardware is headed (or rather,
already is).

> A posted interrupt can land in a not-yet-scheduled vCPU's PIR, but that vCPU
> still won't process it until it actually gets CPU time. IPI tracking targets
> exactly this gap: which vCPU should we yield to right now.
>
> In high VM density / overcommitted scenarios, APICv's advantage narrows
> precisely because the bottleneck shifts from IPI delivery latency to
> *scheduling latency* — the target vCPU may have its posted interrupt sitting
> in PIR but cannot process it because it is competing for physical CPU time
> with many other vCPUs. In that regime, making a better yield-to decision on
> the host side has a more direct impact on end-to-end IPI response time than
> faster hardware delivery to a vCPU that isn't running.
>
> So I would not characterize IPI tracking as a workaround for lack of hardware
> virtualization support. It addresses an orthogonal problem — host-side
> scheduling decisions — that hardware IPI acceleration does not solve. The two
> are complementary: APICv makes delivery fast when the target is running;
> IPI-aware directed yield makes scheduling better when the target is not
> running.

Except they aren't complementary in the sense that, as implemented, they are
mutually exclusive. The x86 changes here rely on tracking IPIs, and unless I'm
missing something in the series, that code falls apart when IPI virtualization
is enabled.