Re: [PATCH 0/4] sched/fair: SMT-aware asymmetric CPU capacity
From: Dietmar Eggemann
Date: Mon Mar 30 2026 - 18:32:45 EST
Hi Andrea,
On 26.03.26 16:02, Andrea Righi wrote:
[...]
> This patch set has been tested on the new NVIDIA Vera Rubin platform, where
> SMT is enabled and the firmware exposes small frequency variations (+/-~5%)
> as differences in CPU capacity, resulting in SD_ASYM_CPUCAPACITY being set.
>
> Without these patches, performance can drop up to ~2x with CPU-intensive
> workloads, because the SD_ASYM_CPUCAPACITY idle selection policy does not
> account for busy SMT siblings.
>
> Alternative approaches have been evaluated, such as equalizing CPU
> capacities, either by exposing uniform values via firmware (ACPI/CPPC) or
> normalizing them in the kernel by grouping CPUs within a small capacity
> window (+-5%) [1][2], or enabling asympacking [3].
>
> However, adding SMT awareness to SD_ASYM_CPUCAPACITY has shown better
> results so far. Improving this policy also seems worthwhile in general, as
> other platforms in the future may enable SMT with asymmetric CPU
> topologies.
I still wonder whether we really need select_idle_capacity() (plus the
smt part) for asymmetric CPU capacity systems where the CPU capacity
differences are < 5% of SCHED_CAPACITY_SCALE.
The known example would be the NVIDIA Grace (!smt) server with its
slightly different perf_caps.highest_perf values.
We did run DCPerf Mediawiki on this thing with:
(1) ASYM_CPUCAPACITY (default)
(2) NO ASYM_CPUCAPACITY
We also ran on a comparable ARM64 server (!smt) for comparison:
(1) ASYM_CPUCAPACITY
(2) NO ASYM_CPUCAPACITY (default)
Both systems have 72 CPUs, run v6.8 and have a single MC sched domain
with LLC spanning over all 72 CPUs. During the tests there were ~750
tasks among them the workload related:
#hhvmworker 147
#mariadbd 204
#memcached 11
#nginx 8
#wrk 144
#ProxygenWorker 1
load_balance:
not_idle 3x more on (2)
idle 2x more on (2)
newly_idle 2-10x more on (2)
wakeup:
move_affine 2-3x more on (1)
ttwu_local 1.5-2 more on (2)
We also instrumented all the bailout conditions in select_task_sibling()
(sis())->select_idle_cpu() and select_idle_capacity() (sic()).
In (1) almost all wakeups end up in select_idle_cpu() returning -1 due
to the fact that 'sd->shared->nr_idle_scan' under SIS_UTIL is 0. So
sis() in (1) almost always returns target (this_cpu or prev_cpu). sic()
doesn't do this.
What I haven't done is to try (1) with SIS_UTIL or (2) with NO_SIS_UTIL.
I wonder whether this is the underlying reason for the benefit of (1)
over (2) we see here with smt now?
So IMHO before adding smt support to (1) for these small CPPC based CPU
capacity differences we should make sure that the same can't be achieved
by disabling SIS_UTIL or to soften it a bit.
So does (2) with NO_SIS_UTIL performs worse than (1) with your smt
related add-ons in sic()?