Re: [PATCH v4 0/6] sched: Fix cluster scheduling in the presence of asymmetric capacity

From: Ricardo Neri

Date: Tue Jun 09 2026 - 23:20:31 EST

On Tue, Jun 09, 2026 at 09:09:25PM +0100, Christian Loehle wrote:
> On 6/9/26 04:19, Ricardo Neri wrote:
> > On Mon, Jun 08, 2026 at 06:37:41PM +0100, Christian Loehle wrote:
> >> On 6/8/26 13:57, Ricardo Neri wrote:
> >>> Hi,
> >>>
> >>> This is v4 of the series. The most important change in this version is a
> >>> pre-work patch to fix a bug that surfaced after the SMT-aware asymmetric
> >>> CPU capacity patchset from Andrea and Prateek [1] was applied. This led me
> >>> to do more testing. Please read the changelog for details.
> >>>
> >>> Cluster scheduling aims to maximize performance by spreading load across
> >>> clusters of CPUs that share mid-level resources [2]. It works well on
> >>> uniform systems, but it breaks down on topologies with big and small
> >>> cores arranged in clusters. As a result, it fails on several generations
> >>> of Intel processors already shipped and upcoming.
> >>>
> >>> Consider the topology below of big (B) cores and clusters of small (s)
> >>> cores.
> >>> ------ ------
> >>> | B | | B | ----------------- -----------------
> >>> | | | | | s | s | s | s | | s | s | s | s |
> >>> ------ ------ ----------------- -----------------
> >>> | L2 | | L2 | | L2 | | L2 |
> >>> -------------------------------------------------------
> >>> | L3 |
> >>> -------------------------------------------------------
> >>>
> >>> On a partially busy system (one with idle CPUs; busy CPUs have one task
> >>> each), scheduling for asymmetric capacity ensures that misfit tasks land on
> >>> the big CPUs. The remaining tasks, misfit or not, run on the small CPUs.
> >>> When CONFIG_SCHED_CLUSTER is enabled, these remaining tasks are supposed to
> >>> be evenly spread among the small-CPU clusters. Today, this does not
> >>> happen.
> >>>
> >>> Several issues in the load balancer prevent a small CPU in one cluster
> >>> from pulling tasks from another:
> >>>
> >>> a) update_sd_pick_busiest() may select a fully_busy group with higher
> >>> per-CPU capacity as the busiest, preventing a subsequent fully_busy
> >>> group of equal capacity from being correctly selected.
> >>> b) Misfit-load statistics are used to identify tasks that would benefit
> >>> from migrating to bigger CPUs. Accounting misfit load is pointless if
> >>> the destination CPU is equally small, and it also blocks balancing
> >>> between clusters.
> >>> c) Due to b), groups that are truly has_spare or fully_busy get
> >>> misclassified as misfit_task. update_sd_pick_busiest() then skips
> >>> them, since a small destination CPU cannot help with misfit tasks.
> >>> d) Once a busiest group has been identified, sched_balance_find_src_rq()
> >>> will refuse to migrate tasks to CPUs of equal capacity, even when
> >>> doing so is precisely what is required to balance small-CPU clusters.
> >>> e) The SD_PREFER_SIBLING flag is missing from scheduling domains with
> >>> asymmetric capacity, preventing the balancer from equalizing load
> >>> across sibling small-core clusters.
> >>>
> >>> Together, these issues prevent cluster-level balancing on systems with
> >>> asymmetric CPU capacity.
> >>>
> >>> This series addresses each problem and restores the intended behavior.
> >>> Details, rationale, and code changes are explained in each patch.
> >>>
> >>> I tested these patches on Alder Lake, which has both SMT Pcores and
> >>> clusters of Ecores. I tested with SMT both disabled and enabled. I also
> >>> tested on Lunar Lake and Panther Lake, which have an Ecore cluster not
> >>> connected to the L3 cache. I repeated the same experiment with
> >>> CONFIG_SCHED_CLUSTER disabled. The load balancer behaves as expected.
> >>>
> >>> Link: https://lore.kernel.org/all/20260509180955.1840064-1-arighi@xxxxxxxxxx/ [1]
> >>> Link: https://lore.kernel.org/r/20210924085104.44806-1-21cnbao@xxxxxxxxx/ [2]
> >>>
> >>> Changes in v4:
> >>> - Patch 1 (pre-work): Fixed a bug that would block load balancing on SMT
> >>> cores with more than one busy sibling.
> >>> - Patch 2 (pre-work): Fixed a bug that would needlessly update
> >>> sg_overloaded.
> >>> - Patch 5: Reworked logic using a local variable for improved
> >>> readability.
> >>> - Added Reviewed-by tags from Chen Yu, Tim, and Vincent. Thanks!
> >>> - Link to v3: https://lore.kernel.org/r/20260514-rneri-fix-cas-clusters-v3-0-0037869554bd@xxxxxxxxxxxxxxx
> >>>
> >>> Changes in v3:
> >>> - Patch 3: Reverted the inverted runtime capacity check. The inverted
> >>> form resulted in migrations to CPUs of slightly lower capacity. Guarded
> >>> the check for architectural capacity with the sched_cluster_active
> >>> static key.
> >>> - Patch 4: Expanded the patch description to explain the behavior of
> >>> overloaded groups and low-capacity clusters with spare capacity.
> >>> - Added Reviewed-by tags from Christian. Thanks!
> >>> - Link to v2: https://lore.kernel.org/r/20260429-rneri-fix-cas-clusters-v2-0-cd787de35cc6@xxxxxxxxxxxxxxx
> >>>
> >>> Changes in v2:
> >>> - Patch 1: Rewrote patch description for clarity. Added a note
> >>> clarifying that SD_ASYM_CPUCAPACITY and SMT are mutually
> >>> exclusive. (Tim)
> >>> - Patch 2: Fixed a bug where the capacity check inadvertently broke
> >>> the mutual exclusion of the sched_reduced_capacity() path. Keep
> >>> marking the root domain as overloaded when misfit tasks are present
> >>> to allow bigger CPUs to help via newly idle balance. (sashiko)
> >>> Fixed the description to state that capacity_greater() looks for
> >>> differences of ~5% or more, not 20%. (Christian)
> >>> - Patch 3: Use arch_scale_cpu_capacity() instead of capacity_of() to
> >>> ignore runtime capacity variability. Inverted the capacity check.
> >>> (Christian)
> >>> - Patch 4: Reworded the patch description for clarity.
> >>> - Link to v1: https://lore.kernel.org/r/20260330-rneri-fix-cas-clusters-v1-0-1e465b6fecb2@xxxxxxxxxxxxxxx/
> >>>
> >>> ---
> >>> Ricardo Neri (6):
> >>> sched/fair: Do not skip CPUs of similar capacity with busy SMT siblings
> >>> sched/fair: Also gate overloaded status update for SD_ASYM_CPUCAPACITY
> >>> sched/fair: Check CPU capacity before comparing group types during load balance
> >>> sched/fair: Skip misfit load accounting when the destination CPU cannot help
> >>> sched/fair: Allow load balancing between CPUs of identical capacity
> >>> sched/topology: Do not clear SD_PREFER_SIBLING in domains with clusters
> >>>
> >>> include/linux/sched/sd_flags.h | 3 ++-
> >>> kernel/sched/fair.c | 57 +++++++++++++++++++++++++++++++-----------
> >>> kernel/sched/topology.c | 14 +++++++++--
> >>> 3 files changed, 56 insertions(+), 18 deletions(-)
> >>> ---
> >>> base-commit: 83313bb25a6ace43b0cb5bde881213e6cfb3b046
> >>> change-id: 20250620-rneri-fix-cas-clusters-bb4287d1e152
> >>>
> >>> Best regards,
> >>
> >> Since I don't really have an arm64 machine that hits the described case just
> >> right, I tested the series on a synthetic arm64 qemu topology with two
> >> equal-capacity little clusters and one 1024 cluster.
> >>
> >> The guest was booted with QEMU virt, 8 CPUs and a custom dtb. The resulting
> >> topology is:
> >> cluster0: CPUs 0-1, cpu_capacity=446
> >> cluster1: CPUs 2-3, cpu_capacity=446
> >> cluster2: CPUs 4-7, cpu_capacity=1024
> >> The dtb describes the clusters with cpu-map. The test kernel was built with
> >> CONFIG_SCHED_CLUSTER enabled.
> >>
> >> I used an rt-app workload with 8 (nr_cpus) SCHED_OTHER tasks.
> >> Each task used the same two phases:
> >> "pinned": {
> >> "loop": 100,
> >> "run": 99000,
> >> "timer": { "ref": "unique", "period": 100000 },
> >> "cpus": [0, 1, 4, 5, 6, 7]
> >> },
> >> "open": {
> >> "loop": 100,
> >> "run": 99000,
> >> "timer": { "ref": "unique", "period": 100000 },
> >> "cpus": [0, 1, 2, 3, 4, 5, 6, 7]
> >> }
> >>
> >> The intent is to first force the workload onto cluster0 plus the big cluster,
> >> leaving the second little cluster unused. Then the affinity mask is opened to
> >> all CPUs. If load balancing across equal-capacity clusters works, CPUs 2-3
> >> should receive a meaningful share of the work (instead of only occasional
> >> migrations).
> >>
> >> I counted rt-app sched_switch events per cluster in the open phase. The pass
> >> condition was that cluster1_little receives at least 20% of open-phase rt-app
> >> sched_switch events.
> >>
> >> Results over three runs (for the open phases):
> >>
> >> mainline:
> >> run0: cluster0 5.7%, cluster1 5.7%, big 88.6% FAIL
> >> run1: cluster0 5.5%, cluster1 6.2%, big 88.3% FAIL
> >> run2: cluster0 4.3%, cluster1 4.7%, big 91.0% FAIL
> >>
> >> with this series:
> >> run0: cluster0 38.6%, cluster1 31.4%, big 30.1% PASS
> >> run1: cluster0 33.2%, cluster1 60.6%, big 6.3% PASS
> >> run2: cluster0 33.3%, cluster1 60.6%, big 6.1% PASS
> >>
> >> (The pinned phase behaved as expected in all runs: there were no rt-app
> >> sched_switch samples on CPUs 2-3 before the affinity mask was opened.)
> >>
> >> For the series (patch 1/6 is a different setup, so maybe except for that)
> >> Tested-by: Christian Loehle <christian.loehle@xxxxxxx>
> >
> > Many thanks for your tests! I have two questions: Do you see similar
> > results of you spawn less tasks than nr_cpus? Perhaps with 6 tasks? They
> > should continue to be evenly distributed on same-capacity clusters.
>
> with 6 tasks:
>
> pinned phase first timestamp: 1850.75
> open phase first timestamp: 1879.02
>
> before_open sched_switch samples: 79
> cpu0: 30
> cpu1: 8
> cpu2: 0
> cpu3: 0
> cpu4: 9
> cpu5: 6
> cpu6: 7
> cpu7: 19
> cluster0_little: 38 (48.1%)
> cluster1_little: 0 (0.0%)
> cluster2_big: 41 (51.9%)
>
> after_open sched_switch samples: 899
> cpu0: 235
> cpu1: 176
> cpu2: 225
> cpu3: 218
> cpu4: 9
> cpu5: 18
> cpu6: 8
> cpu7: 10
> cluster0_little: 411 (45.7%)
> cluster1_little: 443 (49.3%)
> cluster2_big: 45 (5.0%)
>
> after_open sched_migrate_task destination clusters:
> cluster0_little: 380
> cluster1_little: 408
> cluster2_big: 6
>
>
> runtime evaluation:
> before_open runtime: 169.337475s
> cluster0_little: 57.503009s (34.0%)
> cluster1_little: 0.000000s (0.0%)
> cluster2_big: 111.834466s (66.0%)
>
> after_open runtime: 191.293461s
> cluster0_little: 27.926735s (14.6%)
> cluster1_little: 33.729412s (17.6%)
> cluster2_big: 129.637314s (67.8%)
>
> after_open little-to-big rt-app migrations: 2
> 1910.977751 rtapp04-4 cpu0 -> cpu7
> 1911.011555 rtapp03-3 cpu2 -> cpu5
> big-cluster rtapp_task:end count: 6
> first: 1910.975772 last: 1913.225910
>
>
> FWIW I also mirrored the pinned phase (so pinned to cluster1 first):
>
> switch-count evaluation:
> pinned phase first timestamp: 1469.11
> open phase first timestamp: 1498.6
>
> before_open sched_switch samples: 61
> cpu0: 0
> cpu1: 0
> cpu2: 6
> cpu3: 13
> cpu4: 24
> cpu5: 6
> cpu6: 6
> cpu7: 6
> cluster0_little: 0 (0.0%)
> cluster1_little: 19 (31.1%)
> cluster2_big: 42 (68.9%)
>
> after_open sched_switch samples: 883
> cpu0: 221
> cpu1: 178
> cpu2: 226
> cpu3: 219
> cpu4: 10
> cpu5: 7
> cpu6: 11
> cpu7: 11
> cluster0_little: 399 (45.2%)
> cluster1_little: 445 (50.4%)
> cluster2_big: 39 (4.4%)
>
> after_open sched_migrate_task destination clusters:
> cluster0_little: 370
> cluster1_little: 403
> cluster2_big: 8
>
> runtime evaluation:
> before_open runtime: 171.855925s
> cluster0_little: 0.000000s (0.0%)
> cluster1_little: 58.754673s (34.2%)
> cluster2_big: 113.101252s (65.8%)
>
> after_open runtime: 189.467864s
> cluster0_little: 27.279064s (14.4%)
> cluster1_little: 31.392607s (16.6%)
> cluster2_big: 130.796193s (69.0%)
>
> after_open little-to-big rt-app migrations: 3
> 1529.002652 rtapp04-4 cpu0 -> cpu6
> 1529.002902 rtapp02-2 cpu3 -> cpu6
> 1529.778636 rtapp02-2 cpu0 -> cpu7
> big-cluster rtapp_task:end count: 6
> first: 1529.000465 last: 1533.242563

Thanks for the experiment and the details! The patchset works as expected
AFAICS.

>
>
> >
> > Also, are these high-utilization tasks? If yes, the high-capacity cluster
> > should be fully utilized before any tasks overflow to the lower-capacity
> > clusters.
>
> Yes, sorry I should've mentioned, the above rt-app tasks will use 99% of the
> capacity of 1024 CPU, so the expected behavior is that all CPUs are used.
> Once the 1024 CPUs finish (as they should finish in ~half the time), tasks
> of any little cluster will be upmigrated to the 1024.
> I did quickly check if that is the case, which it was,

Great! then it seems I didn't break anything.

> but that part is
> definitely more than wonky on qemu (as the capacities are just in the dtb,
> the CPUs are of course vCPUs which behave very noisily and with no correlation
> to the capacity value).

Indeed! :)