[PATCH v4 0/6] sched: Fix cluster scheduling in the presence of asymmetric capacity

From: Ricardo Neri

Date: Mon Jun 08 2026 - 08:49:07 EST

Hi,

This is v4 of the series. The most important change in this version is a
pre-work patch to fix a bug that surfaced after the SMT-aware asymmetric
CPU capacity patchset from Andrea and Prateek [1] was applied. This led me
to do more testing. Please read the changelog for details.

Cluster scheduling aims to maximize performance by spreading load across
clusters of CPUs that share mid-level resources [2]. It works well on
uniform systems, but it breaks down on topologies with big and small
cores arranged in clusters. As a result, it fails on several generations
of Intel processors already shipped and upcoming.

Consider the topology below of big (B) cores and clusters of small (s)
cores.
------ ------
| B | | B | ----------------- -----------------
| | | | | s | s | s | s | | s | s | s | s |
------ ------ ----------------- -----------------
| L2 | | L2 | | L2 | | L2 |
-------------------------------------------------------
| L3 |
-------------------------------------------------------

On a partially busy system (one with idle CPUs; busy CPUs have one task
each), scheduling for asymmetric capacity ensures that misfit tasks land on
the big CPUs. The remaining tasks, misfit or not, run on the small CPUs.
When CONFIG_SCHED_CLUSTER is enabled, these remaining tasks are supposed to
be evenly spread among the small-CPU clusters. Today, this does not
happen.

Several issues in the load balancer prevent a small CPU in one cluster
from pulling tasks from another:

a) update_sd_pick_busiest() may select a fully_busy group with higher
per-CPU capacity as the busiest, preventing a subsequent fully_busy
group of equal capacity from being correctly selected.
b) Misfit-load statistics are used to identify tasks that would benefit
from migrating to bigger CPUs. Accounting misfit load is pointless if
the destination CPU is equally small, and it also blocks balancing
between clusters.
c) Due to b), groups that are truly has_spare or fully_busy get
misclassified as misfit_task. update_sd_pick_busiest() then skips
them, since a small destination CPU cannot help with misfit tasks.
d) Once a busiest group has been identified, sched_balance_find_src_rq()
will refuse to migrate tasks to CPUs of equal capacity, even when
doing so is precisely what is required to balance small-CPU clusters.
e) The SD_PREFER_SIBLING flag is missing from scheduling domains with
asymmetric capacity, preventing the balancer from equalizing load
across sibling small-core clusters.

Together, these issues prevent cluster-level balancing on systems with
asymmetric CPU capacity.

This series addresses each problem and restores the intended behavior.
Details, rationale, and code changes are explained in each patch.

I tested these patches on Alder Lake, which has both SMT Pcores and
clusters of Ecores. I tested with SMT both disabled and enabled. I also
tested on Lunar Lake and Panther Lake, which have an Ecore cluster not
connected to the L3 cache. I repeated the same experiment with
CONFIG_SCHED_CLUSTER disabled. The load balancer behaves as expected.

Link: https://lore.kernel.org/all/20260509180955.1840064-1-arighi@xxxxxxxxxx/ [1]
Link: https://lore.kernel.org/r/20210924085104.44806-1-21cnbao@xxxxxxxxx/ [2]

Changes in v4:
- Patch 1 (pre-work): Fixed a bug that would block load balancing on SMT
cores with more than one busy sibling.
- Patch 2 (pre-work): Fixed a bug that would needlessly update
sg_overloaded.
- Patch 5: Reworked logic using a local variable for improved
readability.
- Added Reviewed-by tags from Chen Yu, Tim, and Vincent. Thanks!
- Link to v3: https://lore.kernel.org/r/20260514-rneri-fix-cas-clusters-v3-0-0037869554bd@xxxxxxxxxxxxxxx

Changes in v3:
- Patch 3: Reverted the inverted runtime capacity check. The inverted
form resulted in migrations to CPUs of slightly lower capacity. Guarded
the check for architectural capacity with the sched_cluster_active
static key.
- Patch 4: Expanded the patch description to explain the behavior of
overloaded groups and low-capacity clusters with spare capacity.
- Added Reviewed-by tags from Christian. Thanks!
- Link to v2: https://lore.kernel.org/r/20260429-rneri-fix-cas-clusters-v2-0-cd787de35cc6@xxxxxxxxxxxxxxx

Changes in v2:
- Patch 1: Rewrote patch description for clarity. Added a note
clarifying that SD_ASYM_CPUCAPACITY and SMT are mutually
exclusive. (Tim)
- Patch 2: Fixed a bug where the capacity check inadvertently broke
the mutual exclusion of the sched_reduced_capacity() path. Keep
marking the root domain as overloaded when misfit tasks are present
to allow bigger CPUs to help via newly idle balance. (sashiko)
Fixed the description to state that capacity_greater() looks for
differences of ~5% or more, not 20%. (Christian)
- Patch 3: Use arch_scale_cpu_capacity() instead of capacity_of() to
ignore runtime capacity variability. Inverted the capacity check.
(Christian)
- Patch 4: Reworded the patch description for clarity.
- Link to v1: https://lore.kernel.org/r/20260330-rneri-fix-cas-clusters-v1-0-1e465b6fecb2@xxxxxxxxxxxxxxx/

---
Ricardo Neri (6):
sched/fair: Do not skip CPUs of similar capacity with busy SMT siblings
sched/fair: Also gate overloaded status update for SD_ASYM_CPUCAPACITY
sched/fair: Check CPU capacity before comparing group types during load balance
sched/fair: Skip misfit load accounting when the destination CPU cannot help
sched/fair: Allow load balancing between CPUs of identical capacity
sched/topology: Do not clear SD_PREFER_SIBLING in domains with clusters

include/linux/sched/sd_flags.h | 3 ++-
kernel/sched/fair.c | 57 +++++++++++++++++++++++++++++++-----------
kernel/sched/topology.c | 14 +++++++++--
3 files changed, 56 insertions(+), 18 deletions(-)
---
base-commit: 83313bb25a6ace43b0cb5bde881213e6cfb3b046
change-id: 20250620-rneri-fix-cas-clusters-bb4287d1e152

Best regards,
--
Ricardo Neri <ricardo.neri-calderon@xxxxxxxxxxxxxxx>