[PATCH v3 00/20] sched: Introduce cpu_preferred_mask and steal-driven vCPU backoff

From: Shrikanth Hegde

Date: Thu May 14 2026 - 11:26:15 EST


This version is after the OSPM26 Discussion[1]. There was
a good discussion around this problem and there were feedback on some
of the implementation bits. Some of them have been tried/implemented
and few have been deferred.

*** Review and feedback is much appreciated!! ***

[1]:https://youtu.be/adxUKFPlOp0

Briefly, Core idea is:
- Maintain set of CPUs which can be used by workload. It is denoted as
cpu_preferred_mask
- Periodically compute the steal time. If steal time is high/low based
on the thresholds, either reduce/increase the preferred CPUs.
- If a CPU is marked as non-preferred, push the task running on it if
possible.
- Use this CPU state in wakeup and load balance to ensure tasks run
within preferred CPUs.

For more details on idea, problem statement and performance numbers,
please refer to cover-letter of v2[2] and OSPM talk[1].

==========================================================================
Note: This series expect dependent series mentioned below applied on
base (tip/master)
base: 4d034938b6b1 ("Merge branch into tip/master: 'x86/tdx'")
Dependent series: https://lore.kernel.org/all/20260513133934.380347-1-sshegde@xxxxxxxxxxxxx/#t

==========================================================================
Changes since v2[2]:

- Introduce a new config CONFIG_PREFERRED_CPU and make user select
the config for this feature. This was suggested by Yury Norov.
This removes the dependency from PARAVIRT which would make s390
folks happy.

- With CONFIG_PREFERRED_CPU=n, preferred state is same as online state.

- With CONFIG_PREFERRED_CPU=y, always maintain a design construct such
that preferred is always a subset of online.

- Create a debugfs folder called steal_monitor in sched. Move away from
sched_feat since there is no easier way to call additional code when
doing enable/disable. This is essential when one disables the feature
and preferred now has to be same as online to maintain that construct.

- With feature=off, preferred state is same on online state. Feature is
still based on static key to avoid any runtime overhead.

- Prevent the ifdeffery spread to many file. Now the ifdeffery is spread
mainly to */sched.h and cpumask.h, debug.c. Some ifdeffery have been kept
to avoid code bloat and introducing debug files when config=n.

- Using active mask instead of using preferred mask. (One of the ideas
suggested). This is was tried. When there is high steal time,
a CPU marked as not-active isn't available for workload which pins
them. That would break user affinities.
Also there is heavy use of it and it is well known too. So decided
not to use it.

- Support the feature for CONFIG_SCHED_SMT=y. Note that some would have
interpreted my comment as supporting smt or not. It was actually
CONFIG_SCHED_SMT=n(which is rare btw). It was due to ifdeffery around
cpu_smt_mask which was not pretty.
With the effort of removing the ifdeffery around it [3], this series
supports CONFIG_SCHED_SMT=n too.

- Introduce arch specific handling for inc/dec preferred CPUs. This was
a ask from s390 as it may have good hint from HW on which specific
CPUs to take out. I hoping current hooks would work for s390. Please
let me know if it works or not.

- Added comments around O(N2) complexity in rare cases for
select_fallback_rq. (Yury Norov)

- irqbalance=n was considered as not important. It was quite hard to
send interrupt on non-preferred CPUs as well. There was patch sent[4] as
reply to previous version which covers irqbalance=y.

- Performance numbers from v2 (x86, powerpc, s390) showed nice
improvements in some cases without any major regression. Numbers are
expected to similar for this series.

==========================================================================
TODO/OPEN Questions:

- SCHED_EXT is still pending. I tried adding few checks in
scx_idle_test_and_clear_cpu, pick_idle_cpu_in_node and push the
sched_ext task in tick. But it hasn't still worked with scx_simple.
I will try to figure it out. But i may need help since
I am yet wade deeper waters in sched_ext.

- Use PELT kind of signal to smoothen the steal time. This may help
avoid oscillations. Current one works to certain extent.

- NUMA splicing when dec/inc preferred CPUs. Left it as of now as simple
method works quite well. NUMA splicing is going to be heavy.
Is it really necessary? Are there common topology with weird CPU
distributions across NUMA?

- Consider not changing state of isolcpus, since one usually pins the
workload on them anyways. Not typical use case though.

- Corner cases when there are multiple VM's and each may have only one
Core. Are those cases worth taking a look?

- Add cpumask_check at appropriate places.

- Currently it works if all the guests enable the feature. If not one
guest may take advantage of other. Is that to be fixed? Since this has
to be enabled by admins, is that a valid concern still?

[2] v2: https://lore.kernel.org/all/20260407191950.643549-1-sshegde@xxxxxxxxxxxxx/#t
[3]: https://lore.kernel.org/all/20260506110052.9974-1-sshegde@xxxxxxxxxxxxx/#t
[4]: https://lore.kernel.org/all/8beafb01-f891-4b13-8eae-c6f3face7001@xxxxxxxxxxxxx/


PS: There were several suggestions in OSPM discussion; some have been
incorporated, whichever have been intentionally deferred are mentioned
such as sched_ext and rest might have been overlooked.

Please let me know if any specific suggestion should be prioritized
or reconsidered. Please review.

Shrikanth Hegde (20):
sched/debug: Remove unused schedstats
sched/docs: Document cpu_preferred_mask and Preferred CPU concept
kconfig: Provide PREFERRED_CPU option
cpumask: Introduce cpu_preferred_mask
sysfs: Add preferred CPU file
sched/core: allow only preferred CPUs in is_cpu_allowed
sched/fair: Select preferred CPU at wakeup when possible
sched/fair: load balance only among preferred CPUs
sched/rt: Select a preferred CPU for wakeup and pulling rt task
sched/core: Keep tick on non-preferred CPUs until tasks are out
sched/core: Push current task from non preferred CPU
sched/debug: Add migration stats due to non preferred CPUs
sched/debug: Create debugfs folder steal_monitor
sched/debug: Provide debugfs to enable/disable steal monitor
sched/core: Introduce a simple steal monitor
sched/core: Compute steal values at regular intervals
sched/core: Introduce default arch handling code for inc/dec preferred
CPUs
sched/core: Handle steal values and mark CPUs as preferred
sched/core: Mark the direction of steal values to avoid oscillations
sched/debug: Add debug knobs for steal monitor

.../ABI/testing/sysfs-devices-system-cpu | 11 +
Documentation/scheduler/sched-arch.rst | 49 ++++
Documentation/scheduler/sched-debug.rst | 32 +++
drivers/base/cpu.c | 8 +
include/linux/cpumask.h | 21 +-
include/linux/sched.h | 21 +-
kernel/Kconfig.preempt | 13 +
kernel/cpu.c | 16 ++
kernel/sched/core.c | 255 +++++++++++++++++-
kernel/sched/cpupri.c | 1 +
kernel/sched/debug.c | 51 +++-
kernel/sched/fair.c | 6 +-
kernel/sched/rt.c | 4 +
kernel/sched/sched.h | 27 ++
14 files changed, 505 insertions(+), 10 deletions(-)

--
2.47.3