Re: [PATCH v5 0/5] sched/fair: SMT-aware asymmetric CPU capacity
From: Dietmar Eggemann
Date: Tue May 05 2026 - 16:42:50 EST
On 28.04.26 16:41, Andrea Righi wrote:
[...]
> - DCPerf MediaWiki (all CPUs):
>
> +---------------------------------+--------+--------+--------+--------+
> | Configuration | rps | p50 | p95 | p99 |
> +---------------------------------+--------+--------+--------+--------+
> | ASYM (mainline) + SIS_UTIL | 7994 | 0.052 | 0.223 | 0.246 |
> | ASYM (mainline) + NO_SIS_UTIL | 7993 | 0.052 | 0.221 | 0.245 |
> | | | | | |
> | NO ASYM + SIS_UTIL | 8113 | 0.067 | 0.184 | 0.225 |
> | NO ASYM + NO_SIS_UTIL | 8093 | 0.068 | 0.184 | 0.223 |
> | | | | | |
> | ASYM + SMT + SIS_UTIL | 8129 | 0.076 | 0.149 | 0.188 |
> | ASYM + SMT + NO_SIS_UTIL | 8138 | 0.076 | 0.148 | 0.186 |
> +---------------------------------+--------+--------+--------+--------+
>
> In the MediaWiki case SMT awareness is less impactful, because for the majority
> of the run all CPUs are used, but it still seems to provide some benefits at
> reducing tail latency.
>
> Tests have also been conducted on NVIDIA Grace (which does not support SMT) to
> ensure that SIS_UTIL support in select_idle_capacity() does not introduce
> regressions and results show slight improvements under the same workloads.
Somehow unrelated to this smt extension but I always wanted to know why
even with !smt (e.g. Grace) we can see better values w/ ASYM.
DCPerf Mediawiki: Grace 72 CPUs, ~800 tasks (last test run):
+---------------------------------+--------+--------+--------+--------+
| Configuration | rps | p50 | p95 | p99 |
+---------------------------------+--------+--------+--------+--------+
| v6.8 NO ASYM | 4470 | 0.026 | 0.040 | 0.046 |
| v6.8 ASYM | 4636 | 0.022 | 0.037 | 0.043 |
+---------------------------------+--------+--------+--------+--------+
values from run_details.json: Wrk RPS, Nginx P50 {, P90, P95, P99} time
I always got 4%-5% higher rps and slightly better latencies w/ ASYM.
Possible explanation:
NO_ASYM
* More local wakeups
* sis()->select_idle_cpu() runs pretty fast into SIS_UTIL !nr_idle_scan
-> falls back to pick this_cpu or prev_cpu
* Causes more runqueue contention → more load balancing
* More short idle periods + migrations
ASYM
* More remote wakeups
* select_idle_capacity() always scans sd_asym
* Less balancing needed; CPUs go idle less often but for longer
* Better placement -> less contention -> higher rps
AFAICS, in this high-load scenario, ASYM avoids the !nr_idle_scan
bailout, spreading tasks more effectively and so reducing contention and
balancing overhead.
Do you have a chance to check this on mainline on your Grace machine?
[...]