Re: [PATCH v4 1/2] sched/cache: Reduce the overhead of task_cache_work by only scan the visisted cpus

From: Luo Gengkun

Date: Thu Jun 25 2026 - 08:46:51 EST

On 2026/6/20 14:27, Chen, Yu C wrote:

On 6/19/2026 2:54 PM, K Prateek Nayak wrote:

Hello Luo,

On 6/18/2026 12:12 PM, Luo Gengkun wrote:

The overhead of task_cache_work() is high, especially in multi-NUMA
systems. Currently, task_cache_work() tries to find the pref_llc by
scanning all CPUs in the system. However, most of these scans are
meaningless, such as those for CPUs that have never been visited or were
accessed a long time ago.

To address this problem, introduce visited_cpus to track the visited CPUs
and evict them once they have not been accessed for a duration exceeding
llc_epoch_affinity_timeout. With this patch, get_scan_cpumasks() is no
longer need and is therefore removed.

Please include performance numbers here.The hackbench data is shown below:

echo NO_SC_VISIT > /sys/kernel/debug/sched/features
echo SC_NODE > /sys/kernel/debug/sched/features
hackbench-186777 [257] ..... 73726.150430: sched_cache_scan: comm=hackbench pid=186777 scan=384
hackbench-186776 [067] ..... 73726.160434: sched_cache_scan: comm=hackbench pid=186776 scan=384
hackbench-186771 [064] ..... 73726.170435: sched_cache_scan: comm=hackbench pid=186771 scan=384
hackbench-186771 [064] ..... 73726.180429: sched_cache_scan: comm=hackbench pid=186771 scan=384

echo SC_VISIT > /sys/kernel/debug/sched/features
echo NO_SC_NODE > /sys/kernel/debug/sched/features
hackbench-149213 [254] ..... 69412.500073: sched_cache_scan: comm=hackbench pid=149213 scan=8
hackbench-149213 [254] ..... 69412.510073: sched_cache_scan: comm=hackbench pid=149213 scan=8
hackbench-149213 [254] ..... 69412.520076: sched_cache_scan: comm=hackbench pid=149213 scan=8
hackbench-149212 [063] ..... 69412.530077: sched_cache_scan: comm=hackbench pid=149212 scan=8
hackbench-149213 [254] ..... 69412.540074: sched_cache_scan: comm=hackbench pid=149213 scan=8
hackbench-149217 [250] ..... 69412.550073: sched_cache_scan: comm=hackbench pid=149217 scan=8

[root@localhost tracing]# taskset -pc 149217
pid 149217's current affinity list: 0-383

The data above demonstrates that this patch can effectively reduce the number of CPUs
that need to be scanned.

I'm not convinced by the amount of benefits this brings. For hackbench,
the only stable improvement I can see from the cover letter is:

threads 1 4 | 27.758 (1.64%) | 25.711 (1.32%) | 7.37% | IMPROVED

For the thread-mode testing of hackbench, the data provided in the cover letter
aims to demonstrate that this patch does not introduce any performance
degradation. Since there are only a few processes involved in thread-mode, the
overhead incurred by task_cache_work remains negligible. This patch will show an
improvement when multiple processes exist on the system.

Since you mention Redis may benefit from this, do you actually have
numbers for Redis, or any other real world workload on your system?

My previous evaluations were conducted on a Kunpeng platform. The results
are shown below (A total of 384 Redis instances are deployed here):

valkey-benchmark rps | baseline | schedcache | schedcache_visit
----------------------+---------------------+---------------------+-----------+-------------------+--------
| avg latency(ms) | avg latency(ms) | DIFF(%) | avg latency(ms) | DIFF(%)
400000 | 0.37 | 0.496 | -34.1% | 0.299 | +19.18%
However, over the past few days, I have also carried out the testing on an AMD.
The test machine is equipped with dual AMD EPYC 9654 processors, and with SMT
enabled, the system provies a total of 384 CPUs.

The lscpu output is as follows:

Caches (sum of all):
L1d: 6 MiB (192 instances)
L1i: 6 MiB (192 instances)
L2: 192 MiB (192 instances)
L3: 768 MiB (24 instances)
NUMA:
NUMA node(s): 4
NUMA node0 CPU(s): 0-47,192-239
NUMA node1 CPU(s): 48-95,240-287
NUMA node2 CPU(s): 96-143,288-335
NUMA node3 CPU(s): 144-191,336-383

The Redis version used is 7.0.15, and the evaluation was performed via valkey_benchmark.
The benchmarks were executed on the same machine with a strict NUMA topology binding:
redis-server instances were bound to NUMA nodes 0 and 1, while the valkey_benchmark
processes were bound to NUMA nodes 2 and 3. A total of 192 Redis instances were deployed.

[root@localhost redis]# ps -ef | grep redis-server | wc -l
193

The tracing data is pasted as follow:

redis-server-55831 [262] ..... 67888.162975: sched_cache_scan: comm=redis-server pid=55831 scan=9
redis-server-56691 [012] ..... 67888.162975: sched_cache_scan: comm=redis-server pid=56691 scan=14
redis-server-56651 [022] ..... 67888.162975: sched_cache_scan: comm=redis-server pid=56651 scan=8
valkey-benchmar-145367 [200] ..... 67888.162976: sched_cache_scan: comm=valkey-benchmar pid=145367 scan=16
valkey-benchmar-144397 [006] ..... 67888.162977: sched_cache_scan: comm=valkey-benchmar pid=144397 scan=16
valkey-benchmar-145534 [284] ..... 67888.162977: sched_cache_scan: comm=valkey-benchmar pid=145534 scan=16
valkey-benchmar-145253 [032] ..... 67888.162978: sched_cache_scan: comm=valkey-benchmar pid=145253 scan=16

[root@localhost tracing]# taskset -pc 55831
pid 55831's current affinity list: 0-95,192-287

I conducted benchmarks under various RPS (Requests Per Second) loads, and the P99
latency results are listed below.

valkey-benchmark rps | baseline | schedcache | schedcache_visit
----------------------+---------------------+---------------------+-----------+-------------------+--------
| p99 latency(ms) | p99 latency(ms) | DIFF(%) | p99 latency(ms) | DIFF(%)
200000 | 0.26 | 0.383 | -47.3% | 0.264 | -1.5%
300000 | 0.343 | 0.475 | -38.4% | 0.35 | -2.0%
400000 | 0.445 | 0.567 | -27.4% | 0.453 | -1.7%

Additionally, the output of perf top -e cycles:k highlights the overhead incurred
by task_cache_work:

valkey-benchmark rps | schedcache | schedcache_visit
----------------------+-------------------------------------------+---------------------------------------
200000 | 1.12% [kernel] [k] task_cache_work | 0.02% [kernel] [k] task_cache_work
300000 | 0.92% [kernel] [k] task_cache_work | 0.02% [kernel] [k] task_cache_work
400000 | 0.82% [kernel] [k] task_cache_work | 0.02% [kernel] [k] task_cache_work

Although the performance data is not as good as that on Kunpeng, the schedcache_visit
version is still better than the original one.

I agree that metrics for Redis and other workloads would be quite helpful.
Gengkun seems to share the test data I sent over earlier, yet I haven’t observed
a noticeable difference in scores between current node-based scans and visit_cpu scans.

Hi Gengkun,
I wonder if you conducted Redis tests on an AMD machine, if the Redis performance
figures were captured from that environment, it would be helpful if you could
share the corresponding dataset. Should there be no noticeable score variance,
could you verify whether the total scan count has been reduced?

[..snip..]

@@ -1635,11 +1637,21 @@ static inline void __update_mm_sched(struct rq *rq,
      }
}
-static unsigned long fraction_mm_sched(struct rq *rq,
-                       struct sched_cache_time *pcpu_sched)
+static unsigned long fraction_mm_sched(int cpu,
+                       struct mm_struct *mm)
{
+    struct sched_cache_time *pcpu_sched =
+        per_cpu_ptr(mm->sc_stat.pcpu_sched, cpu);
+    struct rq *rq = cpu_rq(cpu);
+
      guard(raw_spinlock_irqsave)(&rq->cpu_epoch_lock);
+    /* Skip the rq that has not been hit for a long time */
+    if ((rq->cpu_epoch - pcpu_sched->epoch_timeout) > llc_epoch_affinity_timeout) {
+        cpumask_clear_cpu(cpu, &mm->sc_stat.visited_cpus);

This makes me think your issue is more with pcpu_sched->runtime
not decaying fast enough?

If you haven't run in 50us, the pcpu_sched->runtime would already be
decayed by >> 5 but that is not enough?

If I understanding correctly, Gengkun intended to reduce the number of
CPUs being scanned. If pcpu_sched->runtime has decayed by 32, the existing code may still check that CPU. And in the visited_cpu proposed
here, I suppose those CPUs will be skipped?

Yes, this patch is designed to reduce the overhead of task_cache_work by
decreasing the number of CPUs that need to be scanned.

          for_each_cpu(cpu, cpus) {
              /* XXX sched_cluster_active */
@@ -1878,18 +1848,21 @@ static void task_cache_work(struct callback_head *work)
                  continue;
              for_each_cpu(i, sched_domain_span(sd)) {

What does your system topology look like? We visit all CPUs of LLC here
so even if a single CPU of LLC is set in visited, you'll still visit all
CPUs of LLC here ...

Yeah, maybe for_each_cpu_and(i, sched_domain_span(sd), cpus)

Agreed. The aforementioned testing were conducted based on the for_each_cpu_and version.
The evaluation code is shown below:
- for_each_cpu(i, sched_domain_span(sd)) {
+ if (sched_feat(SC_VISIT)) {
+ cpumask_and(llc_cpus, sched_domain_span(sd), &mm->sc_stat.visited_cpus);
+ target_cpus = llc_cpus;
+ }
+ else
+ target_cpus = sched_domain_span(sd);
+
+ for_each_cpu(i, target_cpus) {

thanks,
Gengkun

thanks,
Chenyu