Re: [PATCH v2 0/9] sched/kvm: Semantics-aware vCPU scheduling for oversubscribed KVM

From: Wanpeng Li

Date: Fri Jun 12 2026 - 10:08:27 EST

Hi Richie,
On Wed, 13 May 2026 at 20:52, Richie Buturla <richie@xxxxxxxxxxxxx> wrote:
>
>
> On 17/04/2026 12:30, Richie Buturla wrote:
> >
> > On 08/04/2026 10:35, Richie Buturla wrote:
> >>
> >> On 01/04/2026 10:34, Wanpeng Li wrote:
> >>> Hi Christian,
> >>> On Thu, 26 Mar 2026 at 22:42, Christian Borntraeger
> >>> <borntraeger@xxxxxxxxxxxxx> wrote:
> >>>> Am 19.12.25 um 04:53 schrieb Wanpeng Li:
> >>>>> From: Wanpeng Li <wanpengli@xxxxxxxxxxx>
> >>>>>
> >>>>> This series addresses long-standing yield_to() inefficiencies in
> >>>>> virtualized environments through two complementary mechanisms: a vCPU
> >>>>> debooster in the scheduler and IPI-aware directed yield in KVM.
> >>>>>
> >>>>> Problem Statement
> >>>>> -----------------
> >>>>>
> >>>>> In overcommitted virtualization scenarios, vCPUs frequently spin
> >>>>> on locks
> >>>>> held by other vCPUs that are not currently running. The kernel's
> >>>>> paravirtual spinlock support detects these situations and calls
> >>>>> yield_to()
> >>>>> to boost the lock holder, allowing it to run and release the lock.
> >>>>>
> >>>>> However, the current implementation has two critical limitations:
> >>>>>
> >>>>> 1. Scheduler-side limitation:
> >>>>>
> >>>>> yield_to_task_fair() relies solely on set_next_buddy() to
> >>>>> provide
> >>>>> preference to the target vCPU. This buddy mechanism only offers
> >>>>> immediate, transient preference. Once the buddy hint expires
> >>>>> (typically
> >>>>> after one scheduling decision), the yielding vCPU may preempt
> >>>>> the target
> >>>>> again, especially in nested cgroup hierarchies where vruntime
> >>>>> domains
> >>>>> differ.
> >>>>>
> >>>>> This creates a ping-pong effect: the lock holder runs
> >>>>> briefly, gets
> >>>>> preempted before completing critical sections, and the
> >>>>> yielding vCPU
> >>>>> spins again, triggering another futile yield_to() cycle. The
> >>>>> overhead
> >>>>> accumulates rapidly in workloads with high lock contention.
> >>>> Wanpeng,
> >>>>
> >>>> late but not forgotten.
> >>>>
> >>>> So Richie Buturla gave this a try on s390 with some variations but
> >>>> still
> >>>> without cgroup support (next step).
> >>>> The numbers look very promising (diag 9c is our yieldto hypercall).
> >>>> With
> >>>> super high overcommitment the benefit shrinks again, but results
> >>>> are still
> >>>> positive. We are probably running into other limits.
> >>>>
> >>>> 2:1 Overcommit Ratio:
> >>>> diag9c calls: 225,804,073 → 213,913,266 (-5.3%)
> >>>> Dbench thrpt (per-run mean): +1.3%
> >>>> Dbench thrpt (per-run median): +0.8%
> >>>> Dbench thrpt (total across runs): +1.3%
> >>>> Dbench thrpt (avg/VM): +1.3%
> >>>>
> >>>> 4:1:
> >>>> diag9c calls: 833,455,152 → 556,597,627 (-33.2%)
> >>>> Dbench thrpt (per-run mean): +7.2%
> >>>> Dbench thrpt (per-run median): +8.5%
> >>>> Dbench thrpt (total across runs): +7.2%
> >>>> Dbench thrpt (avg/VM): +7.2%
> >>>>
> >>>>
> >>>> 6:1:
> >>>> diag9c calls: 967,501,378 → 737,178,419 (-23.8%)
> >>>> Dbench thrpt (per-run mean): +5.1%
> >>>> Dbench thrpt (per-run median): +4.8%
> >>>> Dbench thrpt (total across runs): +5.1%
> >>>> Dbench thrpt (avg/VM): +5.1%
> >>>>
> >>>>
> >>>>
> >>>> 8:1:
> >>>> diag9c calls: 872,165,596 → 653,481,530 (-25.1%)
> >>>> Dbench thrpt (per-run mean): +11.5%
> >>>> Dbench thrpt (per-run median): +11.4%
> >>>> Dbench thrpt (total across runs): +11.5%
> >>>> Dbench thrpt (avg/VM): +11.5%
> >>>>
> >>>> 9:1:
> >>>> diag9c calls: 809,384,976 → 587,597,163
> >>>> (-27.4%)
> >>>> Dbench thrpt (per-run mean): +4.5%
> >>>> Dbench thrpt (per-run median): +4.0%
> >>>> Dbench thrpt (total across runs): +4.5%
> >>>> Dbench thrpt (avg/VM): +4.5%
> >>>>
> >>>>
> >>>> 10:1:
> >>>> diag9c calls: 711,772,971 → 477,448,374 (-32.9%)
> >>>> Dbench thrpt (per-run mean): +3.6%
> >>>> Dbench thrpt (per-run median): +1.6%
> >>>> Dbench thrpt (total across runs): +3.6%
> >>>> Dbench thrpt (avg/VM): +3.6%
> >>> Thanks Christian, and thanks to Richie for running this on s390. :)
> >>>
> >>> This is very valuable independent data. A few things stand out to me:
> >>>
> >>> - The consistent reduction in diag9c calls across all overcommit
> >>> ratios (up to -33.2% at 4:1) confirms that the directed yield
> >>> improvements are effective at reducing unnecessary yield-to
> >>> hypercalls, not just on x86 but across architectures.
> >>> - The fact that these results are without cgroup support is actually
> >>> informative: it tells us the core yield improvement carries its weight
> >>> on its own, which helps me scope the next revision more tightly.
> >>> - The diminishing-but-still-positive returns at very high overcommit
> >>> (9:1, 10:1) match what I see on x86 as well — other bottlenecks start
> >>> dominating but the mechanism does not regress.
> >>>
> >>> Btw, which kernel version were these results collected on?
> >>>
> >>> Regards,
> >>> Wanpeng
> >>>
> >> Hi Wanpeng,
> >>
> >> I collected these results on a 6.19 kernel - which should also
> >> include the existing fixes for yielding and forfeiting vruntime on
> >> yield that K Prateek mentioned.
> >>
> > Hi Wanpeng. I'm trying out cgroup runs with libvirt but the results
> > seem to vary when I reproduce and need to look into this again so we
> > should not try to base any decisions on the numbers.
> >
> > I'll also rerun on the kernel version you are using (The 6.19-rc1).
> Hi Wanpeng,
>
> I spent some more time benchmarking the scheduler-side changes on s390
> and I think I can now narrow down where the benefit shows up and where
> it does not.

Thanks a lot for spending more time on this and for breaking the
results down by placement. This is exactly the kind of data that
helps separate the scheduler effect from general overcommit noise.

> For context, my test runs have libvirt vms running dbench with the
> number of clients equal to the number of vCPUs, and the workload runs on
> tmpfs so that this is primarily measuring scheduler behavior.
> As far as I can tell, the yield/deboost benefit is constrained to cases
> where the relevant vCPUs are competing on the same runqueue. That makes
> placement the key variable.

Agreed. The scheduler-side part is fundamentally a same-runqueue
optimization. It does not try to pull the target from another rq; it
makes the nominated entity win the local pick once the waiter and the
lock-holder/target are already competing on the same rq.

That also explains why the 1:1 pinning and the matched vCPU:pCPU case
are neutral. With no intra-pool contention there is no local
competition for the scheduler-side hint to resolve, so the path should
be inert rather than provide a measurable gain.

>
> In particular:
>
> 1. With explicit 1:1 vCPU:pCPU pinning, I do not see a meaningful benefit.
> For 3 VMs with 16 vCPUs each pinned to 16 pCPUs, the results were:
>
> diag9c calls: 61,384,968 -> 62,994,594 (+2.6%)
> Dbench throughput mean: -0.5%
> Dbench throughput median: -0.3%
>
> That is basically noise from my point of view. This matches the
> expectation that if the lock waiter and lock holder are not sharing an
> rq, the scheduler-side boost/deboost path has little or nothing to act on.
>
> 2. When vCPUs are pooled onto a smaller pCPU set, I can reproduce a benefit.

This is the target case. The diag9c drop is particularly useful: the
~55-68% reduction shows that the mechanism is cutting yield-to churn,
and the few-percent dbench improvement is consistent with that
reduction showing up as useful work once multiple vCPUs of the same VM
contend within the same CPU pool.

> For 2 VMs with 16 vCPUs each placed on a 8 pCPU pool per VM, I saw:
>
> diag9c calls: 62,893,856 -> 20,033,920 (-68.1%)
> Dbench throughput mean: +4.2%
> Dbench throughput median: +4.0%
>
> For 3 VMs with 16 vCPUs each placed on a 5 pCPU pool per VM, I saw:
>
> diag9c calls: 107,915,379 -> 35,393,080 (-67.2%)
> Dbench throughput mean: +4.4%
> Dbench throughput median: +4.4%
>
> I also saw the same pattern with heavier pooling. For 5 VMs with 16
> vCPUs each placed on a 3 pCPU pool per VM, the results were:
>
> diag9c calls: 130,986,144 -> 58,153,006 (-55.6%)
> Dbench throughput mean: +3.4%
> Dbench throughput median: +3.6%
>
> These are the configurations where I consistently see an improvement in
> reduction of diag9c calls (again our yieldto hypercall) and some
> throughput improvement. This works because the VM is actually
> overcommitted onto its allowed pCPU set, so multiple vCPUs from the same
> VM can contend on the same rq and exercise the mechanism.
>
> 3. If there is no intra-VM overcommit, the effect disappears again.

Right, and that is the expected flip side of the same case: 5 vCPUs on
5 pCPUs has no intra-pool contention, so the path stays inert for the
same reason as strict pinning. The interesting variable is genuinely
the pooling/overcommit onto a smaller pCPU set, not simply "more VMs
on the host", and I think your framing captures that precisely.

For v3, I reworked the scheduler side substantially. It no longer
applies a vruntime penalty to the yielder; instead it credits bounded
EEVDF lag to the nominated next-buddy and rate-limits only the forced
local reschedule. That removes the LCA/debounce/penalty-tracking
machinery from v2, but the placement property you observed remains the
same: it is useful under same-rq vCPU pooling/overcommit, and neutral
when there is no such local contention. I will make that scope
explicit in the v3 cover letter.

> For 3 VMs with 5 vCPUs on a 5 pCPU pool per VM, the results were:
>
> diag9c calls: 696,548 -> 718,219 (+3.1%)
> Dbench throughput mean: -0.8%
> Dbench throughput median: -0.7%
>
> Again, no meaningful benefit.
>
> So my final takeaway is that on s390 I can only demonstrate a benefit
> when the test setup intentionally causes multiple vCPUs of a VM to share
> runqueues. Plain pinning does not show an effect, and a matched
> vCPU:pCPU configuration such as 5 vCPUs on 5 pCPUs does not either. The
> interesting case is specifically vCPU pooling / overcommit onto a
> smaller pCPU set, not just "more VMs on the host".
>
> I suppose this mechanism does help once the waiter/holder pair can
> actually meet on the same rq. If something similar could somehow target
> useful cross-runqueue cases as well, that would seem like a natural way
> to stretch this benefit further.

Yes, agreed. Extending this to useful cross-runqueue cases would
likely need load-balancing or migration decisions, rather than just
directed yield, and that comes with a separate set of fairness and
cache-affinity trade-offs. I would rather keep that as follow-up work
than fold it into this series.

Thanks again for the careful, well-isolated testing.

Wanpeng