Re: [PATCH v2 0/9] sched/kvm: Semantics-aware vCPU scheduling for oversubscribed KVM

From: Wanpeng Li

Date: Wed Apr 01 2026 - 05:47:21 EST

Hi Christian,
On Thu, 26 Mar 2026 at 22:42, Christian Borntraeger
<borntraeger@xxxxxxxxxxxxx> wrote:
>
> Am 19.12.25 um 04:53 schrieb Wanpeng Li:
> > From: Wanpeng Li <wanpengli@xxxxxxxxxxx>
> >
> > This series addresses long-standing yield_to() inefficiencies in
> > virtualized environments through two complementary mechanisms: a vCPU
> > debooster in the scheduler and IPI-aware directed yield in KVM.
> >
> > Problem Statement
> > -----------------
> >
> > In overcommitted virtualization scenarios, vCPUs frequently spin on locks
> > held by other vCPUs that are not currently running. The kernel's
> > paravirtual spinlock support detects these situations and calls yield_to()
> > to boost the lock holder, allowing it to run and release the lock.
> >
> > However, the current implementation has two critical limitations:
> >
> > 1. Scheduler-side limitation:
> >
> > yield_to_task_fair() relies solely on set_next_buddy() to provide
> > preference to the target vCPU. This buddy mechanism only offers
> > immediate, transient preference. Once the buddy hint expires (typically
> > after one scheduling decision), the yielding vCPU may preempt the target
> > again, especially in nested cgroup hierarchies where vruntime domains
> > differ.
> >
> > This creates a ping-pong effect: the lock holder runs briefly, gets
> > preempted before completing critical sections, and the yielding vCPU
> > spins again, triggering another futile yield_to() cycle. The overhead
> > accumulates rapidly in workloads with high lock contention.
>
> Wanpeng,
>
> late but not forgotten.
>
> So Richie Buturla gave this a try on s390 with some variations but still
> without cgroup support (next step).
> The numbers look very promising (diag 9c is our yieldto hypercall). With
> super high overcommitment the benefit shrinks again, but results are still
> positive. We are probably running into other limits.
>
> 2:1 Overcommit Ratio:
> diag9c calls: 225,804,073 → 213,913,266 (-5.3%)
> Dbench thrpt (per-run mean): +1.3%
> Dbench thrpt (per-run median): +0.8%
> Dbench thrpt (total across runs): +1.3%
> Dbench thrpt (avg/VM): +1.3%
>
> 4:1:
> diag9c calls: 833,455,152 → 556,597,627 (-33.2%)
> Dbench thrpt (per-run mean): +7.2%
> Dbench thrpt (per-run median): +8.5%
> Dbench thrpt (total across runs): +7.2%
> Dbench thrpt (avg/VM): +7.2%
>
>
> 6:1:
> diag9c calls: 967,501,378 → 737,178,419 (-23.8%)
> Dbench thrpt (per-run mean): +5.1%
> Dbench thrpt (per-run median): +4.8%
> Dbench thrpt (total across runs): +5.1%
> Dbench thrpt (avg/VM): +5.1%
>
>
>
> 8:1:
> diag9c calls: 872,165,596 → 653,481,530 (-25.1%)
> Dbench thrpt (per-run mean): +11.5%
> Dbench thrpt (per-run median): +11.4%
> Dbench thrpt (total across runs): +11.5%
> Dbench thrpt (avg/VM): +11.5%
>
> 9:1:
> diag9c calls: 809,384,976 → 587,597,163 (-27.4%)
> Dbench thrpt (per-run mean): +4.5%
> Dbench thrpt (per-run median): +4.0%
> Dbench thrpt (total across runs): +4.5%
> Dbench thrpt (avg/VM): +4.5%
>
>
> 10:1:
> diag9c calls: 711,772,971 → 477,448,374 (-32.9%)
> Dbench thrpt (per-run mean): +3.6%
> Dbench thrpt (per-run median): +1.6%
> Dbench thrpt (total across runs): +3.6%
> Dbench thrpt (avg/VM): +3.6%

Thanks Christian, and thanks to Richie for running this on s390. :)

This is very valuable independent data. A few things stand out to me:

- The consistent reduction in diag9c calls across all overcommit
ratios (up to -33.2% at 4:1) confirms that the directed yield
improvements are effective at reducing unnecessary yield-to
hypercalls, not just on x86 but across architectures.
- The fact that these results are without cgroup support is actually
informative: it tells us the core yield improvement carries its weight
on its own, which helps me scope the next revision more tightly.
- The diminishing-but-still-positive returns at very high overcommit
(9:1, 10:1) match what I see on x86 as well — other bottlenecks start
dominating but the mechanism does not regress.

Btw, which kernel version were these results collected on?

Regards,
Wanpeng