Re: [PATCH v2 00/17] sched/paravirt: Introduce cpu_preferred_mask and steal-driven vCPU backoff

From: Hillf Danton

Date: Wed Apr 08 2026 - 06:24:22 EST

On Wed, 8 Apr 2026 00:49:33 +0530 Shrikanth Hegde wrote:
> In the virtualized environment, often there is vCPU overcommit. i.e. sum
> of CPUs in all guests(virtual CPU aka vCPU) exceed the underlying physical CPU
> (managed by host aka pCPU).
>
> When many guests ask for CPU at the same time, host/hypervisor would
> fail to satisfy that ask and has to preempt one vCPU to run another. If
> the guests co-ordinate and ask for less CPU overall, that reduces the
> vCPU threads in host, and vCPU preemption goes down.
>
> Steal time is an indication of the underlying contention. Based on that,
> if the guests reduce the vCPU request that proportionally, it would achieve
> the desired outcome.
>
> The added advantage is, it would reduce the lockholder preemption.
> A vCPU maybe holding a spinlock, but still could get preempted. Such cases
> will reduce since there is less vCPU preemption and lockholder will run to
> completion since it would have disabled preemption in the guest.
> Workload could run with time-slice extention to reduce lockholder
> preemption for userspace locks, and this could help reduce lockholder
> preemption even for kernelspace due to vCPU preemption.
>
> Currently there is no infra in scheduler which moves away the task from
> some CPUs without breaking the userspace affinities. CPU hotplug,
> isolated CPUset would achieve moving the task off some CPUs at runtime,
> But if some task is affined to specific CPUs, taking those CPUs away
> results in affinity list being reset. That breaks the user affinities,
> Since this is driven by scheduler rather than user doing so, can't do
> that. So need a new infra. It would be better if it is lightweight.
>
> Core idea is:
> - Maintain set of CPUs which can be used by workload. It is denoted as
> cpu_preferred_mask
> - Periodically compute the steal time. If steal time is high/low based
> on the thresholds, either reduce/increase the preferred CPUs.
> - If a CPU is marked as non-preferred, push the task running on it if
> possible.
> - Use this CPU state in wakeup and load balance to ensure tasks run
> within preferred CPUs.
>
> For the host kernel, there is no steal time, so no changes to its preferred
> CPUs. So series would affect only the guest kernels.
>
Changes are added to guest in order to detect if pCPU is overloaded, and if
that is true (I mean it is layer violation), why not ask the pCPU governor,
hypervisor, to monitor the loads on pCPU and migrate vCPUs forth and back
if necessary.

> Current series implements a simple steal time monitor, which
> reduces/increases the number of cores by 1 depending on the steal time.
> It also implements a very simple method to avoid oscillations. If there
> is need a need for more complex mechanisms for these, then doing them
> via a steal time governors maybe an idea. One needs to enable the
> feature STEAL_MONITOR to see the steal time values being processed and
> preferred CPUs being set correctly. In most of the systems where there
> is no steal time, preferred CPUs will be same as online CPUs.
>
> I will attach the irqbalance patch which detects the changes in this
> mask and re-adjusts the irq affinities. Series doesn't address when
> irqbalance=n. Assuming many distros have irqbalance=y by default.
>
> Discussion at LPC 2025:
> https://www.youtube.com/watch?v=sZKpHVUUy1g
>
> *** Please provide your suggestions and comments ***
>
> =====================================================================
> Patch Layout:
> PATCH 01: Remove stale schedstats. Independent of the series.
> PATCH 02-04: Introduce cpu_preferred_mask.
> PATCH 05-09: Make scheduler aware of this mask.
> PATCH 10: Push the current task in sched_tick if cpu is non-preferred.
> PATCH 11: Add a new schedstat.
> PATCH 12: Add a new sched feature: STEAL_MONITOR
> PATCH 13-17: Periodically calculating steal time and take appropriate
> action.
>
> ======================================================================
> Performance Numbers:
> baseline: tip/master at 8a5f70eb7e4f (Merge branch into tip/master: 'x86/tdx')
>
> on PowerPC: powerVM hypervisor:
> +++++++++
> Daytrader
> +++++++++
> It is a database workload which simulates stock live trading.
> There are two VMs. The same workload is run in both VMs at the same time.
> VM1 is bigger than VM2.
>
> Note: VM1 sees 20% steal time, and VM2 sees 10% steal time with
> baseline.
>
>
> (with series: STEAL_MONITOR=y and Default debug steal_mon values)
> On VM1:
> baseline with_series
> Throughput 1x 1.3x
> On VM2:
> baseline with_series
> Throughput 1x 1.1x
>
>
> (with series: STEAL_MONITOR=y and Period 100, High 200, Low 100)
> On VM1:
> baseline with_series
> Throughput: 1x 1.45x
> On VM2:
> baseline with_series
> Throughput: 1x 1.13x
>
> Verdict: Shows good improvement with default values. Even better when
> tuned the debug knobs.
>
> +++++++++
> Hackbench
> +++++++++
> (with series: STEAL_MONITOR=y and Period 100, High 200, Low 100)
> On VM1:
> baseline with_series
> 10 groups 10.3 8.5
> 30 groups 40.8 25.5
> 60 groups 77.2 47.8
>
> on VM2:
> baseline with_series
> 10 groups 8.4 7.5
> 30 groups 25.3 19.8
> 60 groups 41.7 36.3
>
> Verdict: With tuned values, shows very good improvement.
>
> ==========================================================================
> Since v1:
> - A new name - Preferred CPUs and cpu_preferred_mask
> I had initially used the name as "Usable CPUs", but this seemed
> better. I thought of pv_preferred too, but left it as it could be too long.
>
> - Arch independent code. Everything happens in scheduler. steal time is
> generic construct and this would help avoid each architecture doing the
> same thing more or less. Dropped powerpc code.
>
> - Removed hacks around wakeups. Made it as part of available_idle_cpu
> which take care of many of the wakeup decisions. same for rt code.
>
> - Implement a work function to calculate the steal times and enforce the
> policy decisions. This ensures sched_tick doesn't suffer any major
> latency.
>
> - Steal time computation is gated with sched feature STEAL_MONITOR to
> avoid any overheads in systems which don't have vCPU overcommit.
> Feature is disabled by default.
>
> - CPU_CAPACITY=1 was not considered since one needs the state of all CPUs
> which have this special value. Computing that in hotpath is not ideal.
>
> - Using cpuset was not considered since it was quite tricky, given there
> is different versions and cgroups is natively user driven.
>
> v1: https://lore.kernel.org/all/20251119124449.1149616-1-sshegde@xxxxxxxxxxxxx/#t
> earlier versions: https://lore.kernel.org/all/236f4925-dd3c-41ef-be04-47708c9ce129@xxxxxxxxxxxxx/
>
> TODO:
> - Splicing of CPUs across NUMA nodes when CPUs aren't split equally.
> - irq affinity when irqbalance=n. Not sure if this is worth.
> - Avoid running any unbound housekeeping work on non-preferred CPUs
> such as in find_new_ilb. Tried, but showed a little regression in
> no noise case. So didn't consider.
> - This currently works for kernel built with CONFIG_SCHED_SMT. Didn't
> want to sprinkle too many ifdefs there. Not sure if there is any
> system which needs this feature but !SMT. If so, let me know.
> Seeing those ifdefs makes me wonder, Maybe we could cleanup
> CONFIG_SCHED_SMT with cpumask_of(cpu) in case of !SMT?
> - Performance numbers in KVM with x86, s390.
>
> Sorry for sending it this late. This series is the one which is meant
> for discussion at OSPM 2026.
>
>
> Shrikanth Hegde (17):
> sched/debug: Remove unused schedstats
> sched/docs: Document cpu_preferred_mask and Preferred CPU concept
> cpumask: Introduce cpu_preferred_mask
> sysfs: Add preferred CPU file
> sched/core: allow only preferred CPUs in is_cpu_allowed
> sched/fair: Select preferred CPU at wakeup when possible
> sched/fair: load balance only among preferred CPUs
> sched/rt: Select a preferred CPU for wakeup and pulling rt task
> sched/core: Keep tick on non-preferred CPUs until tasks are out
> sched/core: Push current task from non preferred CPU
> sched/debug: Add migration stats due to non preferred CPUs
> sched/feature: Add STEAL_MONITOR feature
> sched/core: Introduce a simple steal monitor
> sched/core: Compute steal values at regular intervals
> sched/core: Handle steal values and mark CPUs as preferred
> sched/core: Mark the direction of steal values to avoid oscillations
> sched/debug: Add debug knobs for steal monitor
>
> .../ABI/testing/sysfs-devices-system-cpu | 11 +
> Documentation/scheduler/sched-arch.rst | 48 ++++
> Documentation/scheduler/sched-debug.rst | 27 +++
> drivers/base/cpu.c | 12 +
> include/linux/cpumask.h | 22 ++
> include/linux/sched.h | 4 +-
> kernel/cpu.c | 6 +
> kernel/sched/core.c | 219 +++++++++++++++++-
> kernel/sched/cpupri.c | 4 +
> kernel/sched/debug.c | 10 +-
> kernel/sched/fair.c | 8 +-
> kernel/sched/features.h | 3 +
> kernel/sched/rt.c | 4 +
> kernel/sched/sched.h | 41 ++++
> 14 files changed, 409 insertions(+), 10 deletions(-)
>
> --
> 2.47.3
>
>