[PATCH v4 0/2] Cache aware scheduling: Reduce the overhead of task_cache_work

From: Luo Gengkun

Date: Thu Jun 18 2026 - 02:16:39 EST

Hi everyone,
The cache-aware scheduling patches have now been merged into the mainline.
The goal of this patch is to reduce the overhead of task_cache_work() by
minimizing the number of scanned CPUs, which yields significant
performance gains in multi-instance scenarios like Redis. To facilitate
testing, this debug patch is introduced. Below are the benchmark results
based on hackbench:

Test steps:
echo NO_SC_VISIT > /sys/kernel/debug/sched/features
echo NO_SC_NODE > /sys/kernel/debug/sched/features
echo 0 > /sys/kernel/debug/sched/llc_balancing/enabled
./launch.sh hackbench baseline

echo NO_SC_VISIT > /sys/kernel/debug/sched/features
echo SC_NODE > /sys/kernel/debug/sched/features
echo 1 > /sys/kernel/debug/sched/llc_balancing/enabled
./launch.sh hackbench schedcache

echo SC_VISIT > /sys/kernel/debug/sched/features
echo NO_SC_NODE > /sys/kernel/debug/sched/features
echo 1 > /sys/kernel/debug/sched/llc_balancing/enabled
./launch.sh hackbench schedcache_visit

Test results:
./launch.sh compare hackbench baseline schedcache
=========================================
Hackbench Comparison: baseline vs schedcache
=========================================
MODE GROUPS FDS | baseline(std) | schedcache(std) | DIFF(%) | VERDICT
---------- ------ -----+--------------------+--------------------+------------+-----------
threads 1 10 | 113.200 (4.22%) | 67.300 (1.32%) | 40.55% | IMPROVED
threads 1 2 | 16.555 (4.11%) | 11.020 (1.66%) | 33.43% | IMPROVED
threads 1 20 | 250.774 (1.26%) | 265.026 (5.44%) | -5.68% | REGRESSED
threads 1 4 | 42.117 (1.44%) | 27.758 (1.64%) | 34.09% | IMPROVED
threads 1 6 | 65.140 (4.31%) | 39.182 (1.38%) | 39.85% | IMPROVED
threads 1 8 | 84.286 (1.29%) | 53.721 (1.58%) | 36.26% | IMPROVED
threads 2 10 | 122.592 (0.44%) | 113.365 (4.93%) | 7.53% | IMPROVED
threads 2 2 | 17.702 (4.09%) | 10.473 (0.42%) | 40.84% | IMPROVED
threads 2 20 | 336.457 (1.77%) | 314.108 (1.51%) | 6.64% | IMPROVED
threads 2 4 | 43.989 (0.88%) | 27.067 (3.38%) | 38.47% | IMPROVED
threads 2 6 | 69.322 (0.85%) | 41.707 (4.19%) | 39.84% | IMPROVED
threads 2 8 | 103.767 (1.81%) | 58.518 (3.00%) | 43.61% | IMPROVED
threads 4 10 | 148.882 (3.56%) | 149.449 (1.06%) | -0.38% | REGRESSED
threads 4 2 | 18.909 (2.96%) | 11.063 (2.08%) | 41.49% | IMPROVED
threads 4 20 | 724.943 (2.14%) | 631.222 (3.92%) | 12.93% | IMPROVED
threads 4 4 | 48.191 (1.91%) | 27.352 (5.35%) | 43.24% | IMPROVED
threads 4 6 | 79.725 (3.84%) | 78.732 (4.10%) | 1.25% | IMPROVED
threads 4 8 | 108.768 (1.36%) | 105.928 (1.65%) | 2.61% | IMPROVED

Hackbench Comparison: schedcache vs schedcache_visit
=========================================
MODE GROUPS FDS | schedcache(std) |schedcache_visit(std) | DIFF(%) | VERDICT
---------- ------ -----+--------------------+----------------------+---------+-----------
threads 1 10 | 67.300 (1.32%) | 67.014 (0.96%) | 0.42% | IMPROVED
threads 1 2 | 11.020 (1.66%) | 10.557 (1.46%) | 4.20% | IMPROVED
threads 1 20 | 265.026 (5.44%) | 212.366 (16.32%) | 19.87% | IMPROVED
threads 1 4 | 27.758 (1.64%) | 25.711 (1.32%) | 7.37% | IMPROVED
threads 1 6 | 39.182 (1.38%) | 38.914 (0.34%) | 0.68% | IMPROVED
threads 1 8 | 53.721 (1.58%) | 52.889 (0.27%) | 1.55% | IMPROVED
threads 2 10 | 121.203 (6.99%) | 124.254 (1.38%) | -2.52% | REGRESSED
threads 2 2 | 10.473 (0.42%) | 11.206 (5.91%) | -7.00% | REGRESSED
threads 2 20 | 314.108 (1.51%) | 301.754 (1.95%) | 3.93% | IMPROVED
threads 2 4 | 27.067 (3.38%) | 28.028 (2.01%) | -3.55% | REGRESSED
threads 2 6 | 41.707 (4.19%) | 42.149 (3.35%) | -1.06% | REGRESSED
threads 2 8 | 58.518 (3.00%) | 57.133 (4.39%) | 2.37% | IMPROVED
threads 4 10 | 149.449 (1.06%) | 141.407 (0.08%) | 5.38% | IMPROVED
threads 4 2 | 11.063 (2.08%) | 11.360 (5.85%) | -2.68% | REGRESSED
threads 4 20 | 631.222 (3.92%) | 622.780 (2.49%) | 1.34% | IMPROVED
threads 4 4 | 27.352 (5.35%) | 27.947 (5.37%) | -2.18% | REGRESSED
threads 4 6 | 78.732 (4.10%) | 73.911 (0.70%) | 6.12% | IMPROVED
threads 4 8 | 105.928 (1.65%) | 107.535 (3.29%) | -1.52% | REGRESSED

---
Changes history
**v4 Changes:**
1. Rebase to the master.
2. epoch_timeout is introduced to evict expired CPUs instead of relying on
epoch, because epoch is refreshed periodically due to invocations of
fraction_mm_sched().
3. Move the increasement of nr_running before fraction_mm_sched().
4. Remove the redundant 'work->next' reset at the end of task_cache_work().
4. Add a debug patch to show the number of CPUs scanned to show the
benefit of this optimization.

Link to v3: https://lore.kernel.org/all/20260423085414.1389749-1-luogengkun2@xxxxxxxxxx/

**v3 Changes:**
1. Remove the static key and enable this feature by default.
2. Reuse llc_epoch_affinity_timeout instead of introducing
llc_epoch_visited_timeout.
3. Move the calculation of rq->cpu_epoch - pcpu_sched->epoch into
fraction_mm_sched() to avoid race between task_cache_work() and
__update_mm_sched().
4. Reset work->next at the end of task_cache_work() to prevent concurrent
executions by multiple threads within the same process.

Link to v2: https://lore.kernel.org/all/20260414150745.225416-1-luogengkun2@xxxxxxxxxx/

**v2 Changes:**
1. Added a pre-check before set/clear visited_cpus to avoid C2C overhead.
2. Optimized llc_epoch_visited_timeout by using a static key to minimize overhead.
---

Link to v1: https://lore.kernel.org/all/f2488085-4b52-491d-84be-d30d43954381@xxxxxxxxxx/
---

Luo Gengkun (2):
sched/cache: Reduce the overhead of task_cache_work by only scan the
visisted cpus
-- DO NOT APPLY!!! -- sched/cache/debug: Add trace event and sched
feature to track scan cost

include/linux/sched.h | 2 ++
include/trace/events/sched.h | 21 ++++++++++++++++
kernel/sched/fair.c | 46 +++++++++++++++++++++++++++++-------
kernel/sched/features.h | 2 ++
4 files changed, 62 insertions(+), 9 deletions(-)

--
2.34.1