Re: [PATCH v3 00/10] sched/fair, KVM: Semantics-aware directed yield for oversubscribed KVM

From: K Prateek Nayak

Date: Fri Jun 12 2026 - 01:17:25 EST


Hello Wanpeng,

On 6/12/2026 7:03 AM, Wanpeng Li wrote:
> Part 1: Scheduler EEVDF lag credit (patches 1-5)
>
> Rather than penalizing the yielding vCPU, credit the nominated target so
> pick_eevdf() honors the buddy hint.
>
> The mechanism is EEVDF-native and cgroup-hierarchy-aware:
>
> - Credit bounded EEVDF lag to the nominated next-buddy so pick_eevdf()'s
> PICK_BUDDY branch returns it. Walk the same ancestor chain that
> set_next_buddy() nominated and credit each not-yet-eligible level, so the
> hint is not dropped at the first ineligible group entity.

I believe Peter is planning to flatten the pick by v7.3 so I would
suggest you to test the flattened pick series [1] which is available in
Peter's tree in sched/flat branch [2].

That should get rid of the need to traverse the hierarchy and
should solve one part of your problem of yielding to vCPUs across
different cgroups.

[1] https://lore.kernel.org/lkml/20260605105513.354837583@xxxxxxxxxxxxx/
[2] https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/log/?h=sched/flat

>
> - Credit to a small positive-vlag margin, not merely the vlag = 0
> eligibility boundary, so the target stays eligible across several
> scheduling decisions rather than a single pick. The margin scales with
> runqueue depth and is clamped to entity_lag()'s legal positive-lag bound,
> preserving EEVDF fairness.
>
> - Handle both the off-tree current entity (shifted in place, carrying any
> vprot window)


Is this even possible to yield to an out of tree entity? The core bits
in syscalls.c already bails out for:

if (task_on_cpu(p_rq, p) || !task_is_running(p))
return;

and the early bits in yield_to_task_fair() bail out for "!se->on_rq"
which makes me wonder when will we ever have the p->se as cfs_rq->curr
while holding both the p->pi_lock and the rq_lock?

The task must be on the rq while being preempted for yield_to to work,
no?

> and a queued (on-tree) entity (repositioned via the
> canonical place_entity()-paired requeue used by requeue_delayed_entity(),
> keeping sum_w_vruntime consistent with entity_key()).
>
> - Force a local reschedule at the end of the credit path: cancel
> RUN_TO_PARITY slice protection along the yielder's sched_entity chain and
> resched_curr() the local CPU. Only this forced preemption is rate
> limited (once per 6ms per rq) to avoid excessive forced preemption on
> PLE-heavy guests; the lag credit itself runs on every directed yield.
>
> The mechanism is gated by SCHED_FEAT(YIELD_TO_LAG_CREDIT) (default on).
> With the feature off, yield_to_task_fair() keeps the existing forfeit-only
> behavior.

[...]

> The gains stem from three factors:
>
> 1. Lock holders receive sustained CPU time to complete critical sections,
> reducing lock hold duration and cascading contention.
>
> 2. IPI receivers are scheduled promptly when senders spin, reducing IPI
> response latency and wasted spin cycles.

Looking at kvm_smp_send_call_func_ipi() in arch/x86/kernel/kvm.c, there
can be multiple destination vCPUs for the IPI. Why does it make sense
for the sender to yield almost all its time to the first vCPU on the
mask then?

And do all IPIs have to spin? Can't they be async too?

>
> 3. Reduced context switching between lock waiters and holders improves
> cache utilization.
>

[...]

>
> Patch Organization
> ------------------
>
> Patches 1-5: Scheduler EEVDF lag credit
>
> Patch 1: Add the eevdf_credit_entity_vlag() primitive and the
> YIELD_TO_LAG_CREDIT feature. Handles the off-tree current
> entity and has no functional effect on its own.
>
> Patch 2: Credit to a persistent, queue-depth-scaled positive-vlag
> margin, clamped to entity_lag()'s legal bound.
>
> Patch 3: Extend the primitive to a queued (on-tree) entity via the
> canonical place_entity()-paired requeue.
>
> Patch 4: Wire the credit walk into yield_to_task_fair(), crediting each
> level of the nominated ancestor chain.
>
> Patch 5: Force a local reschedule (cancel RUN_TO_PARITY slice protection
> and resched_curr()) so the credited buddy can be selected.
> Activation patch; rate-limits only the forced preemption.

I don't know if it is just me but this structure made it insanely
difficult to review with unused functions and callers being only added
at Patch 4 to understand how it all worked.

All of this will require rework with flattened pick but I would suggest
adding the simple lag movement bits first and adding the
eevdf_persistent_margin() magic later on top.

--
Thanks and Regards,
Prateek