[Patch v4 00/22] Cache aware scheduling

From: Tim Chen

Date: Wed Apr 01 2026 - 17:46:57 EST

This patch series introduces infrastructure for cache-aware load
balancing, with the goal of co-locating tasks that share data within
the same Last Level Cache (LLC) domain. By improving cache locality,
the scheduler can reduce cache bouncing and cache misses, ultimately
improving data access efficiency. The design builds on the initial
prototype from Peter [1].

This initial implementation treats threads within the same process
as entities that are likely to share data. During load balancing, the
scheduler attempts to aggregate such threads onto the same LLC domain
whenever possible.

Most of the feedback received on v3 has been addressed. Some aspects
could be enhanced later after the basic cache-aware portion has landed:

There were discussions around grouping tasks using mechanisms other
than process membership. While we agree that more flexible grouping
is desirable, this series intentionally focuses on establishing basic
process-based grouping first, with alternative grouping mechanisms to
be explored in a follow-on series.

There was also discussion in v3 that the task wakeup path should be used
to perform cache-aware scheduling. According to previous test results,
performing task aggregation in the wakeup path introduced task migration
bouncing. Primarily that was due to the wake up path not having the up
to date LLC load information. That led to over-aggregation that needed
to be corrected later in load balancing. Load balancing path was chosen
as the conservative path to perform task aggregation. The task wakeup
path will be investigated as a future enhancement.

Furthermore, there was also requests to make cache-aware scheduling
benefit small LLC systems. Peter suggested using an llc-mask instead of
a single llc value for preferences[2]. This could also be implemented
as a future enhancement.

The cache aware load balancing logic remains largely unchanged. The
significant changes in v4 are:

1. LLC ID management: the calculation of the LLC ID switches to using
bitmap allocation rather than maintaining a static value.
2. Introduce a new patch [2/22] to limit the CPU scan span with
preferred NUMA node when NUMA balancing is enabled.
3. Tweaks in load balance failure considerations where
keeping load imbalance at low load and not pulling task from preferred
LLC is not considered as a balance failure.

Other changes are described in each patch.

Test results:
The patch series was applied and tested on v7.0-rc3.
Git tree can be found here:
https://github.com/timcchen1298/linux/tree/cache_aware_v4

The first test platform is a 2 socket Intel Sapphire Rapids with 30
cores per socket. The DRAM interleaving is enabled in the BIOS so it
essential has one NUMA node with two last level caches. There are 60
CPUs associated with each last level cache.

The second test platform is a AMD Genoa. There are 4 Nodes and 32 CPUs
per node. Each node has 2 CCXs and each CCX has 16 CPUs.

hackbench/schbench/netperf/stream/stress-ng/chacha20 were launched
on these two platforms.

[TL;DR]
Sappire Rapids:
hackbench shows significant improvement when the number of
different active threads is below the capacity of a LLC.
schbench shows limitted wakeup latency improvement.
ChaCha20-xiangshan(risc-v simulator) shows good throughput
improvement. No obvious difference was observed in
netperf/stream/stress-ng in Hmean.

Genoa:
Significant improvement is observed in hackbench when
the active number of threads is lower than the number
of CPUs within 1 LLC. On v2, Aaron reported improvement
of hackbench/redis when system is underloaded.
ChaCha20-xiangshan shows huge throughput improvement.
Phoronix has tested v1 and shows good improvements in 30+
cases[3]. No obvious difference was observed in
netperf/stream/stress-ng in Hmean.

Detail:
To conserve space, data without much difference with
baseline is not presented.

Sapphire Rapids:
[hackbench pipe]
================
case load baseline(std%) compare%( std%)
threads-pipe-10 1-groups 1.00 ( 1.22) +26.09 ( 1.10)
threads-pipe-10 2-groups 1.00 ( 4.90) +22.88 ( 0.18)
threads-pipe-10 4-groups 1.00 ( 2.07) +9.00 ( 3.49)
threads-pipe-10 8-groups 1.00 ( 8.13) +3.45 ( 3.62)
threads-pipe-16 1-groups 1.00 ( 2.11) +26.30 ( 0.08)
threads-pipe-16 2-groups 1.00 ( 15.13) -1.77 ( 11.89)
threads-pipe-16 4-groups 1.00 ( 4.37) +0.58 ( 7.99)
threads-pipe-16 8-groups 1.00 ( 2.88) +2.71 ( 3.50)
threads-pipe-2 1-groups 1.00 ( 9.40) +22.07 ( 0.71)
threads-pipe-2 2-groups 1.00 ( 9.99) +18.01 ( 0.95)
threads-pipe-2 4-groups 1.00 ( 3.98) +24.66 ( 0.96)
threads-pipe-2 8-groups 1.00 ( 7.00) +21.83 ( 0.23)
threads-pipe-20 1-groups 1.00 ( 1.03) +28.84 ( 0.21)
threads-pipe-20 2-groups 1.00 ( 4.42) +31.90 ( 3.15)
threads-pipe-20 4-groups 1.00 ( 9.97) +4.56 ( 1.69)
threads-pipe-20 8-groups 1.00 ( 1.87) +1.25 ( 0.74)
threads-pipe-4 1-groups 1.00 ( 4.48) +25.67 ( 0.78)
threads-pipe-4 2-groups 1.00 ( 9.14) +4.91 ( 2.08)
threads-pipe-4 4-groups 1.00 ( 7.68) +19.36 ( 1.53)
threads-pipe-4 8-groups 1.00 ( 10.79) +7.20 ( 12.20)
threads-pipe-8 1-groups 1.00 ( 4.69) +21.93 ( 0.03)
threads-pipe-8 2-groups 1.00 ( 1.16) +25.29 ( 0.65)
threads-pipe-8 4-groups 1.00 ( 2.23) -1.27 ( 3.62)
threads-pipe-8 8-groups 1.00 ( 4.65) -3.08 ( 2.75)

Note: The default number of fd in hackbench is changed from 20 to various
values to ensure that threads fit within a single LLC, especially on AMD
systems. Take "threads-pipe-8, 2-groups" for example, the number of fd
is 8, and 2 groups are created.

[schbench]
The 99th percentile wakeup latency shows some improvements when the
system is underload, while it does not bring much difference with
the increasing of system utilization.

99th Wakeup Latencies Base (mean std) Compare (mean std) Change
--------------------------------------------------------------------------------
thread=2 9.00(0.00) 9.00(1.73) 0.00%
thread=4 7.33(0.58) 6.33(0.58) +13.64%
thread=8 9.00(0.00) 7.67(1.15) +14.78%
thread=16 8.67(0.58) 8.67(1.53) 0.00%
thread=32 9.00(0.00) 7.00(0.00) +22.22%
thread=64 9.33(0.58) 9.67(0.58) -3.64%
thread=128 12.00(0.00) 12.00(0.00) 0.00%

[chacha200]
baseline:
Host time spent: 67861ms
cache aware scheduling enabled:
Host time spent: 54441ms

Time reduced by 24%

Genoa:
[hackbench pipe]
The default number of fd is 20, which exceed the number of CPUs
in a LLC. So the fd is adjusted to 2, 4, 6, 8, 20 respectively.
Exclude the result with large run-to-run variance, 10% ~ 50%
improvement is observed when the system is underloaded:

[hackbench pipe]
================
case load baseline(std%) compare%( std%)
threads-pipe-2 1-groups 1.00 ( 2.89) +47.33 ( 1.20)
threads-pipe-2 2-groups 1.00 ( 3.88) +39.82 ( 0.61)
threads-pipe-2 4-groups 1.00 ( 8.76) +5.57 ( 13.10)
threads-pipe-20 1-groups 1.00 ( 4.61) +11.72 ( 1.06)
threads-pipe-20 2-groups 1.00 ( 6.18) +14.55 ( 1.47)
threads-pipe-20 4-groups 1.00 ( 2.99) +10.16 ( 4.49)
threads-pipe-4 1-groups 1.00 ( 4.23) +43.70 ( 2.14)
threads-pipe-4 2-groups 1.00 ( 3.68) +8.45 ( 4.04)
threads-pipe-4 4-groups 1.00 ( 17.72) +2.42 ( 1.14)
threads-pipe-6 1-groups 1.00 ( 3.10) +7.74 ( 3.83)
threads-pipe-6 2-groups 1.00 ( 3.42) +14.26 ( 4.53)
threads-pipe-6 4-groups 1.00 ( 10.34) +10.94 ( 7.12)
threads-pipe-8 1-groups 1.00 ( 4.21) +9.06 ( 4.43)
threads-pipe-8 2-groups 1.00 ( 1.88) +3.74 ( 0.58)
threads-pipe-8 4-groups 1.00 ( 2.78) +23.96 ( 1.18)

[chacha200]
Host time spent: 54762ms
Host time spent: 28295ms

Time reduced by 48%

[1] https://lore.kernel.org/all/cover.1760206683.git.tim.c.chen@xxxxxxxxxxxxxxx/
[2] https://lore.kernel.org/all/20260219165221.GM1395266@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/
[3] https://www.phoronix.com/review/cache-aware-scheduling-amd-turin

Change history:
**v4 Changes:**
1. Using bitmap based LLC id dynamic allocation mechanism.
2. Introduce a new patch [2/22] to limit the CPU scan depth with
preferred NUMA node.
3. Keeping load imbalance at low load and not pulling task from preferred
LLC is not considered as a balance failure.
4. Other changes from v3 are detailed in each patch's change log.

**v3 Changes:**
v3 link: https://lore.kernel.org/all/cover.1770760558.git.tim.c.chen@xxxxxxxxxxxxxxx/
1. Cache-aware scheduling is skipped after repeated load balance
failures (up to cache_nice_tries). This avoids repeatedly attempting
cache-aware migrations when no movable tasks prefer the destination
LLC.
2. The busiest runqueue is no longer sorted to select tasks that prefer
the destination LLC. This sorting was costly, and equivalent
behavior can be achieved by skipping tasks that do not prefer the
destination LLC during cache-aware migrations.
3. Accounting of the number of tasks preferring each LLC is now kept in
the lowest-level sched domain per CPU. This simplifies handling of
LLC resizing and changes in the number of LLC domains.
4. Other changes from v2 are detailed in each patch's change log.

**v2 Changes:**
v2 link: https://lore.kernel.org/all/cover.1764801860.git.tim.c.chen@xxxxxxxxxxxxxxx/
1. Align NUMA balancing and cache affinity by
prioritizing NUMA balancing when their decisions differ.
2. Dynamically resize per-LLC statistics structures based on the LLC
size.
3. Switch to a contiguous LLC-ID space so these IDs can be used
directly as array indices for LLC statistics.
4. Add clarification comments.
5. Add 3 debug patches (not meant for merging).
6. Other changes to address feedbacks from review of v1 patch set
(see individual patch change log).

**v1**
v1 link: https://lore.kernel.org/all/cover.1760206683.git.tim.c.chen@xxxxxxxxxxxxxxx/

Chen Yu (10):
sched/cache: Limit the scan number of CPUs when calculating task
occupancy
sched/cache: Record per LLC utilization to guide cache aware
scheduling decisions
sched/cache: Introduce helper functions to enforce LLC migration
policy
sched/cache: Disable cache aware scheduling for processes with high
thread counts
sched/cache: Avoid cache-aware scheduling for memory-heavy processes
sched/cache: Enable cache aware scheduling for multi LLCs NUMA node
sched/cache: Allow the user space to turn on and off cache aware
scheduling
sched/cache: Add user control to adjust the aggressiveness of
cache-aware scheduling
-- DO NOT APPLY!!! -- sched/cache/debug: Display the per LLC occupancy
for each process via proc fs
-- DO NOT APPLY!!! -- sched/cache/debug: Add ftrace to track the load
balance statistics

Peter Zijlstra (Intel) (1):
sched/cache: Introduce infrastructure for cache-aware load balancing

Tim Chen (11):
sched/cache: Make LLC id continuous
sched/cache: Assign preferred LLC ID to processes
sched/cache: Track LLC-preferred tasks per runqueue
sched/cache: Introduce per CPU's tasks LLC preference counter
sched/cache: Calculate the percpu sd task LLC preference
sched/cache: Count tasks prefering destination LLC in a sched group
sched/cache: Check local_group only once in update_sg_lb_stats()
sched/cache: Prioritize tasks preferring destination LLC during
balancing
sched/cache: Add migrate_llc_task migration type for cache-aware
balancing
sched/cache: Handle moving single tasks to/from their preferred LLC
sched/cache: Respect LLC preference in task migration and detach

fs/proc/base.c | 31 +
include/linux/cacheinfo.h | 21 +-
include/linux/mm_types.h | 43 ++
include/linux/sched.h | 32 +
include/linux/sched/topology.h | 17 +
include/trace/events/sched.h | 140 ++++
init/Kconfig | 11 +
init/init_task.c | 3 +
kernel/fork.c | 6 +
kernel/sched/core.c | 13 +
kernel/sched/debug.c | 58 +-
kernel/sched/fair.c | 1180 +++++++++++++++++++++++++++++++-
kernel/sched/sched.h | 50 ++
kernel/sched/topology.c | 234 ++++++-
14 files changed, 1810 insertions(+), 29 deletions(-)

--
2.32.0