Re: [PATCH 00/10] sched/kvm: Semantics-aware vCPU scheduling for oversubscribed KVM
From: Christian Borntraeger
Date: Mon Nov 10 2025 - 07:06:51 EST
Am 10.11.25 um 04:32 schrieb Wanpeng Li:
From: Wanpeng Li <wanpengli@xxxxxxxxxxx>
This series addresses long-standing yield_to() inefficiencies in
virtualized environments through two complementary mechanisms: a vCPU
debooster in the scheduler and IPI-aware directed yield in KVM.
Problem Statement
-----------------
In overcommitted virtualization scenarios, vCPUs frequently spin on locks
held by other vCPUs that are not currently running. The kernel's
paravirtual spinlock support detects these situations and calls yield_to()
to boost the lock holder, allowing it to run and release the lock.
However, the current implementation has two critical limitations:
1. Scheduler-side limitation:
yield_to_task_fair() relies solely on set_next_buddy() to provide
preference to the target vCPU. This buddy mechanism only offers
immediate, transient preference. Once the buddy hint expires (typically
after one scheduling decision), the yielding vCPU may preempt the target
again, especially in nested cgroup hierarchies where vruntime domains
differ.
This creates a ping-pong effect: the lock holder runs briefly, gets
preempted before completing critical sections, and the yielding vCPU
spins again, triggering another futile yield_to() cycle. The overhead
accumulates rapidly in workloads with high lock contention.
I can certainly confirm that on s390 we do see that yield_to does not always
work as expected. Our spinlock code is lock holder aware so our KVM always yield
correctly but often enought the hint is ignored our bounced back as you describe.
So I am certainly interested in that part.
I need to look more closely into the other part.