Re: [PATCH 00/10] sched/kvm: Semantics-aware vCPU scheduling for oversubscribed KVM
From: Christian Borntraeger
Date: Tue Nov 18 2025 - 03:18:22 EST
Am 12.11.25 um 06:01 schrieb Wanpeng Li:
Hi Christian,
On Mon, 10 Nov 2025 at 20:02, Christian Borntraeger
<borntraeger@xxxxxxxxxxxxx> wrote:
Am 10.11.25 um 04:32 schrieb Wanpeng Li:
From: Wanpeng Li <wanpengli@xxxxxxxxxxx>
This series addresses long-standing yield_to() inefficiencies in
virtualized environments through two complementary mechanisms: a vCPU
debooster in the scheduler and IPI-aware directed yield in KVM.
Problem Statement
-----------------
In overcommitted virtualization scenarios, vCPUs frequently spin on locks
held by other vCPUs that are not currently running. The kernel's
paravirtual spinlock support detects these situations and calls yield_to()
to boost the lock holder, allowing it to run and release the lock.
However, the current implementation has two critical limitations:
1. Scheduler-side limitation:
yield_to_task_fair() relies solely on set_next_buddy() to provide
preference to the target vCPU. This buddy mechanism only offers
immediate, transient preference. Once the buddy hint expires (typically
after one scheduling decision), the yielding vCPU may preempt the target
again, especially in nested cgroup hierarchies where vruntime domains
differ.
This creates a ping-pong effect: the lock holder runs briefly, gets
preempted before completing critical sections, and the yielding vCPU
spins again, triggering another futile yield_to() cycle. The overhead
accumulates rapidly in workloads with high lock contention.
I can certainly confirm that on s390 we do see that yield_to does not always
work as expected. Our spinlock code is lock holder aware so our KVM always yield
correctly but often enought the hint is ignored our bounced back as you describe.
So I am certainly interested in that part.
I need to look more closely into the other part.
Thanks for the confirmation and interest! It's valuable to hear that
s390 observes similar yield_to() behavior where the hint gets ignored
or bounced back despite correct lock holder identification.
Since your spinlock code is already lock-holder-aware and KVM yields
to the correct target, the scheduler-side improvements (patches 1-5)
should directly address the ping-pong issue you're seeing. The
vruntime penalties are designed to sustain the preference beyond the
transient buddy hint, which should reduce the bouncing effect.
So we will play a bit with the first patches and check for performance improvements.
I am curious, I did a quick unit test with 2 CPUs ping ponging on a counter. And I do
see "more than count" numbers of the yield hypercalls with that testcase (as before).
Something like 40060000 yields instead of 4000000 for a perfect ping pong. If I comment
out your rate limit code I hit exactly the 4000000.
Can you maybe outline a bit why the rate limit is important and needed?