[RFC PATCH 0/8] sched: Extend cache-aware scheduling into topology-aware scheduling

From: Jianyong Wu

Date: Wed Jun 24 2026 - 23:09:17 EST

The current cache-aware scheduling implementation adopts an
LLC-centric task aggregation model. While effective for workloads
that fit within a single LLC domain, this design is fundamentally
limited by a fixed aggregation scope that cannot scale across
scheduling domains.

This leads to a single structural limitation: the lack of
topology-scalable task aggregation. When workload size exceeds
the capacity of an LLC domain, the scheduler cannot extend
aggregation to higher-level domains, and locality cannot be
preserved effectively. At the same time, higher-level topology
information such as NUMA domains cannot be consistently utilized
for placement decisions.

This patch set addresses this limitation by extending
cache-aware scheduling into topology-aware task aggregation.
The aggregation scope becomes hierarchical and can dynamically
expand or contract across scheduling domains based on workload
demand.

Task aggregation starts at MC or LLC domains under light load,
and expands to NUMA and higher-level domains as load increases,
and contracts when load decreases.

This design improves locality across different workload sizes
and system topologies.

The interaction with NUMA balancing is also improved by clearly
separating responsibilities: cache-aware scheduling handles task
placement and migration, while NUMA balancing handles memory
placement. This allows both mechanisms to align toward the same
NUMA domain, reducing remote memory access.

This approach is particularly beneficial on systems with deep
CPU topology hierarchies and relatively small LLC domains, where
a fixed LLC-centric aggregation model is insufficient to
maintain locality under higher load. For example, modern server
systems with multiple NUMA nodes and relatively small
per-domain cache capacities often require cross-domain
scheduling to sustain locality at scale.

The following performance data was collected on a Hygon x86 server with
the following topology:

* 2 sockets
* 6 NUMA nodes per socket
* 2 LLC domains per NUMA node
* 8 cores per LLC domain
* 2 SMT threads per core

The baseline kernel is 4b99990cdf95, which includes the cache-aware
scheduling feature.

Unless otherwise noted, all tests were performed with
`/sys/kernel/debug/sched/llc_balancing/aggr_tolerance` set to 90.

[hackbench]
NUMA Balancing is disabled.
(lower is better, normalized to baseline)

test cmd: hackbench -T -p -f $f -g $g -l 100000

case load baseline patched improvement
=====================================================================
threads-pipe-2 1-groups 1.00 0.978 +2.2%
threads-pipe-2 2-groups 1.00 1.037 -3.7%
threads-pipe-4 1-groups 1.00 1.054 -5.4%
threads-pipe-4 2-groups 1.00 1.229 -22.9%
threads-pipe-8 1-groups 1.00 1.106 -10.6%
threads-pipe-8 2-groups 1.00 0.528 +47.2%
threads-pipe-16 1-groups 1.00 0.503 +49.7%
threads-pipe-16 2-groups 1.00 0.562 +43.8%
threads-pipe-32 1-groups 1.00 0.627 +37.3%
threads-pipe-32 2-groups 1.00 0.615 +38.5%
threads-pipe-48 1-groups 1.00 0.684 +31.6%
threads-pipe-48 2-groups 1.00 0.776 +22.4%

For the pipe-4, 2-group and pipe-8 2-group workload, the baseline kernel
aggregates most of the 16 threads within a single LLC domain, while the
patched kernel expands aggregation to the NUMA level. Since the workload
still fits within an LLC domain, the baseline benefits from stronger cache
locality, leading to a small and expected performance regression with
the patched kernel. Notably, with overaggr_pct set to 50, the observed
behavior of the baseline kernel is somewhat unexpected and may warrant
further investigation.

Once the number of hackbench threads exceeds the capacity of a single
LLC domain, the fixed LLC-centric aggregation model becomes less
effective. In contrast, the patched kernel can dynamically expand task
aggregation to higher scheduling domains, resulting in substantial
performance gains over the baseline.

[schbench]
NUMA Balancing is disabled.
p99 wakeup latency (lower is better, normalized to baseline)

threads baseline patched improvement
================================================
2-threads 1.00 0.900 +10.0%
4-threads 1.00 1.000 +0.0%
8-threads 1.00 0.968 +3.2%
16-threads 1.00 0.877 +12.3%
32-threads 1.00 0.794 +20.6%
64-threads 1.00 0.852 +14.8%
128-threads 1.00 0.954 +4.6%

Once the number of threads exceeds the capacity of a single LLC domain,
the patched kernel consistently delivers performance improvements, with
no performance regressions observed.

[MySQL]
point_select test with NUMA balance enabled:

thread num baseline patched improvement
======================================================
4 1.00 1.70620013 70.62%
8 1.00 1.201839311 20.18%
16 1.00 1.087489969 8.75%
32 1.00 1.150214081 15.02%
64 1.00 1.194663894 19.47%
128 1.00 0.95585509 -4.41%
256 1.00 1.027373011 2.74%

delete test with NUMA balance enabled:

thread num baseline patched improvement
=======================================================
4 1.00 1.186089537 18.61%
8 1.00 1.288780932 28.88%
16 1.00 1.078755447 7.88%
32 1.00 1.473220484 47.32%
64 1.00 4.601490272 360.15%
128 1.00 2.360467168 136.05%
256 1.00 1.059600923 5.96%

In the MySQL workload, the baseline kernel may make conflicting
placement decisions between cache-aware scheduling and NUMA balancing.
NUMA balancing can select a preferred node that differs from the one
implied by cache-aware scheduling, disrupting task aggregation even
when the workload would otherwise fit within a single LLC domain. This
explains the performance gains observed even at low thread counts
such as 4 and 8 threads.

For workloads whose thread count exceeds the capacity of a single LLC
domain, the patched kernel continues to deliver performance
improvements by expanding task aggregation to higher scheduling domains
while maintaining NUMA affinity. As the workload grows further and the
aggregation scope reaches its effective limit, the performance gains
eventually plateau.

The delete workload is write-intensive, making it especially
sensitive to cross-domain cache-coherence overhead. At 64 threads,
cache-aware scheduling in the baseline kernel scatters tasks
broadly. Each write then triggers cacheline invalidations that
propagate across NUMA domains, and this coherence traffic dominates
execution time. In contrast, the patched kernel aggregates tasks
to fewer NUMA nodes, eliminating most of the cross-domain
invalidation traffic and delivering a disproportionate speedup.

Testing on additional platforms including Intel and AMD will be conducted
later.

Jianyong Wu (8):
sched/topo: Add some llc related helpers
sched/fair: Introduce helpers for cross-domain migration decisions
sched/fair: Introduce rq affinity gain calculation for migration
selection
sched/fair: Pick optimal src rq/group using affinity promotion metric
sched/fair: Drop prefer_sibling restriction for llc_balance
sched/fair: Judge migration eligibility via NUMA-wide
sched: Let sched cache take precedence over NUMA balancing
sched/debug: Print task preferred LLC for scheduler debugging

include/linux/topology.h | 5 +
kernel/sched/debug.c | 28 +++-
kernel/sched/fair.c | 326 ++++++++++++++++++++++++++++++---------
kernel/sched/sched.h | 1 +
kernel/sched/topology.c | 58 +++++++
5 files changed, 345 insertions(+), 73 deletions(-)

--
2.34.1