Re: [PATCH v2 0/9] sched/kvm: Semantics-aware vCPU scheduling for oversubscribed KVM

From: Richie Buturla

Date: Fri Apr 17 2026 - 07:35:24 EST

On 08/04/2026 10:35, Richie Buturla wrote:

On 01/04/2026 10:34, Wanpeng Li wrote:

Hi Christian,
On Thu, 26 Mar 2026 at 22:42, Christian Borntraeger
<borntraeger@xxxxxxxxxxxxx> wrote:

Am 19.12.25 um 04:53 schrieb Wanpeng Li:

From: Wanpeng Li <wanpengli@xxxxxxxxxxx>

This series addresses long-standing yield_to() inefficiencies in
virtualized environments through two complementary mechanisms: a vCPU
debooster in the scheduler and IPI-aware directed yield in KVM.

Problem Statement
-----------------

In overcommitted virtualization scenarios, vCPUs frequently spin on locks
held by other vCPUs that are not currently running. The kernel's
paravirtual spinlock support detects these situations and calls yield_to()
to boost the lock holder, allowing it to run and release the lock.

However, the current implementation has two critical limitations:

1. Scheduler-side limitation:

     yield_to_task_fair() relies solely on set_next_buddy() to provide
     preference to the target vCPU. This buddy mechanism only offers
     immediate, transient preference. Once the buddy hint expires (typically
     after one scheduling decision), the yielding vCPU may preempt the target
     again, especially in nested cgroup hierarchies where vruntime domains
     differ.

     This creates a ping-pong effect: the lock holder runs briefly, gets
     preempted before completing critical sections, and the yielding vCPU
     spins again, triggering another futile yield_to() cycle. The overhead
     accumulates rapidly in workloads with high lock contention.

Wanpeng,

late but not forgotten.

So Richie Buturla gave this a try on s390 with some variations but still
without cgroup support (next step).
The numbers look very promising (diag 9c is our yieldto hypercall). With
super high overcommitment the benefit shrinks again, but results are still
positive. We are probably running into other limits.

2:1 Overcommit Ratio:
diag9c calls:                       225,804,073 → 213,913,266 (-5.3%)
Dbench thrpt (per-run mean):        +1.3%
Dbench thrpt (per-run median):      +0.8%
Dbench thrpt (total across runs):   +1.3%
Dbench thrpt (avg/VM):              +1.3%

4:1:
diag9c calls:                       833,455,152 → 556,597,627 (-33.2%)
Dbench thrpt (per-run mean):        +7.2%
Dbench thrpt (per-run median):      +8.5%
Dbench thrpt (total across runs):   +7.2%
Dbench thrpt (avg/VM):              +7.2%

6:1:
diag9c calls:                       967,501,378 → 737,178,419 (-23.8%)
Dbench thrpt (per-run mean):        +5.1%
Dbench thrpt (per-run median):      +4.8%
Dbench thrpt (total across runs):   +5.1%
Dbench thrpt (avg/VM):              +5.1%

8:1:
diag9c calls:                       872,165,596 → 653,481,530 (-25.1%)
Dbench thrpt (per-run mean):        +11.5%
Dbench thrpt (per-run median):      +11.4%
Dbench thrpt (total across runs):   +11.5%
Dbench thrpt (avg/VM):              +11.5%

9:1:
diag9c calls:                       809,384,976 → 587,597,163 (-27.4%)
Dbench thrpt (per-run mean):        +4.5%
Dbench thrpt (per-run median):      +4.0%
Dbench thrpt (total across runs):   +4.5%
Dbench thrpt (avg/VM):              +4.5%

10:1:
diag9c calls:                       711,772,971 → 477,448,374 (-32.9%)
Dbench thrpt (per-run mean):        +3.6%
Dbench thrpt (per-run median):      +1.6%
Dbench thrpt (total across runs):   +3.6%
Dbench thrpt (avg/VM):              +3.6%

Thanks Christian, and thanks to Richie for running this on s390. :)

This is very valuable independent data. A few things stand out to me:

- The consistent reduction in diag9c calls across all overcommit
ratios (up to -33.2% at 4:1) confirms that the directed yield
improvements are effective at reducing unnecessary yield-to
hypercalls, not just on x86 but across architectures.
- The fact that these results are without cgroup support is actually
informative: it tells us the core yield improvement carries its weight
on its own, which helps me scope the next revision more tightly.
- The diminishing-but-still-positive returns at very high overcommit
(9:1, 10:1) match what I see on x86 as well — other bottlenecks start
dominating but the mechanism does not regress.

Btw, which kernel version were these results collected on?

Regards,
Wanpeng

Hi Wanpeng,

I collected these results on a 6.19 kernel - which should also include the existing fixes for yielding and forfeiting vruntime on yield that K Prateek mentioned.

Hi Wanpeng. I'm trying out cgroup runs with libvirt but the results seem to vary when I reproduce and need to look into this again so we should not try to base any decisions on the numbers.

I'll also rerun on the kernel version you are using (The 6.19-rc1).