Re: [PATCH 00/10] sched/kvm: Semantics-aware vCPU scheduling for oversubscribed KVM

From: K Prateek Nayak

Date: Tue Nov 11 2025 - 01:28:43 EST

Hello Wanpeng,

I haven't looked at the entire series and the penalty calculation math
but I've a few questions looking at the cover-letter.

On 11/10/2025 9:02 AM, Wanpeng Li wrote:
> From: Wanpeng Li <wanpengli@xxxxxxxxxxx>
>
> This series addresses long-standing yield_to() inefficiencies in
> virtualized environments through two complementary mechanisms: a vCPU
> debooster in the scheduler and IPI-aware directed yield in KVM.
>
> Problem Statement
> -----------------
>
> In overcommitted virtualization scenarios, vCPUs frequently spin on locks
> held by other vCPUs that are not currently running. The kernel's
> paravirtual spinlock support detects these situations and calls yield_to()
> to boost the lock holder, allowing it to run and release the lock.
>
> However, the current implementation has two critical limitations:
>
> 1. Scheduler-side limitation:
>
> yield_to_task_fair() relies solely on set_next_buddy() to provide
> preference to the target vCPU. This buddy mechanism only offers
> immediate, transient preference. Once the buddy hint expires (typically
> after one scheduling decision), the yielding vCPU may preempt the target
> again, especially in nested cgroup hierarchies where vruntime domains
> differ.

So what you are saying is there are configurations out there where vCPUs
of same guest are put in different cgroups? Why? Does the use case
warrant enabling the cpu controller for the subtree? Are you running
with the "NEXT_BUDDY" sched feat enabled?

If they are in the same cgroup, the recent optimizations/fixes to
yield_task_fair() in queue:sched/core should help remedy some of the
problems you might be seeing.

For multiple cgroups, perhaps you can extend yield_task_fair() to do:

( Only build and boot tested on top of
git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git sched/core
at commit f82a0f91493f "sched/deadline: Minor cleanup in
select_task_rq_dl()" )

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b4617d631549..87560f5a18b3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8962,10 +8962,28 @@ static void yield_task_fair(struct rq *rq)
* which yields immediately again; without the condition the vruntime
* ends up quickly running away.
*/
- if (entity_eligible(cfs_rq, se)) {
+ do {
+ cfs_rq = cfs_rq_of(se);
+
+ /*
+ * Another entity will be selected at next pick.
+ * Single entity on cfs_rq can never be ineligible.
+ */
+ if (!entity_eligible(cfs_rq, se))
+ break;
+
se->vruntime = se->deadline;
se->deadline += calc_delta_fair(se->slice, se);
- }
+
+ /*
+ * If we have more than one runnable task queued below
+ * this cfs_rq, the next pick will likely go for a
+ * different entity now that we have advanced the
+ * vruntime and the deadline of the running entity.
+ */
+ if (cfs_rq->h_nr_runnable > 1)
+ break;
+ } while ((se = parent_entity(se)));
}

static bool yield_to_task_fair(struct rq *rq, struct task_struct *p)
---

With that, I'm pretty sure there is a good chance we'll not select the
hierarchy that did a yield_to() unless there is a large discrepancy in
their weights and just advancing se->vruntime to se->deadline once isn't
enough to make it ineligible and you'll have to do it multiple time (at
which point that cgroup hierarchy needs to be studied).

As for the problem that NEXT_BUDDY hint is used only once, you can
perhaps reintroduce LAST_BUDDY which sets does a set_next_buddy() for
the "prev" task during schedule?

>
> This creates a ping-pong effect: the lock holder runs briefly, gets
> preempted before completing critical sections, and the yielding vCPU
> spins again, triggering another futile yield_to() cycle. The overhead
> accumulates rapidly in workloads with high lock contention.
>
> 2. KVM-side limitation:
>
> kvm_vcpu_on_spin() attempts to identify which vCPU to yield to through
> directed yield candidate selection. However, it lacks awareness of IPI
> communication patterns. When a vCPU sends an IPI and spins waiting for
> a response (common in inter-processor synchronization), the current
> heuristics often fail to identify the IPI receiver as the yield target.

Can't that be solved on the KVM end? Also shouldn't Patch 6 be on top
with a "Fixes:" tag.

>
> Instead, the code may boost an unrelated vCPU based on coarse-grained
> preemption state, missing opportunities to accelerate actual IPI
> response handling. This is particularly problematic when the IPI receiver
> is runnable but not scheduled, as lock-holder-detection logic doesn't
> capture the IPI dependency relationship.

Are you saying the yield_to() is called with an incorrect target vCPU?

>
> Combined, these issues cause excessive lock hold times, cache thrashing,
> and degraded throughput in overcommitted environments, particularly
> affecting workloads with fine-grained synchronization patterns.
>
--
Thanks and Regards,
Prateek