Re: [PATCH v4 1/2] sched/cache: Reduce the overhead of task_cache_work by only scan the visisted cpus

From: Chen, Yu C

Date: Sat Jun 20 2026 - 02:28:06 EST

On 6/19/2026 2:54 PM, K Prateek Nayak wrote:

Hello Luo,

On 6/18/2026 12:12 PM, Luo Gengkun wrote:

The overhead of task_cache_work() is high, especially in multi-NUMA
systems. Currently, task_cache_work() tries to find the pref_llc by
scanning all CPUs in the system. However, most of these scans are
meaningless, such as those for CPUs that have never been visited or were
accessed a long time ago.

To address this problem, introduce visited_cpus to track the visited CPUs
and evict them once they have not been accessed for a duration exceeding
llc_epoch_affinity_timeout. With this patch, get_scan_cpumasks() is no
longer need and is therefore removed.

Please include performance numbers here.

I'm not convinced by the amount of benefits this brings. For hackbench,
the only stable improvement I can see from the cover letter is:

threads 1 4 | 27.758 (1.64%) | 25.711 (1.32%) | 7.37% | IMPROVED

Since you mention Redis may benefit from this, do you actually have
numbers for Redis, or any other real world workload on your system?

I agree that metrics for Redis and other workloads would be quite helpful.
Gengkun seems to share the test data I sent over earlier, yet I haven’t observed
a noticeable difference in scores between current node-based scans and visit_cpu scans.

Hi Gengkun,
I wonder if you conducted Redis tests on an AMD machine, if the Redis performance
figures were captured from that environment, it would be helpful if you could
share the corresponding dataset. Should there be no noticeable score variance,
could you verify whether the total scan count has been reduced?

[..snip..]

@@ -1635,11 +1637,21 @@ static inline void __update_mm_sched(struct rq *rq,
}
}
-static unsigned long fraction_mm_sched(struct rq *rq,
- struct sched_cache_time *pcpu_sched)
+static unsigned long fraction_mm_sched(int cpu,
+ struct mm_struct *mm)
{
+ struct sched_cache_time *pcpu_sched =
+ per_cpu_ptr(mm->sc_stat.pcpu_sched, cpu);
+ struct rq *rq = cpu_rq(cpu);
+
guard(raw_spinlock_irqsave)(&rq->cpu_epoch_lock);
+ /* Skip the rq that has not been hit for a long time */
+ if ((rq->cpu_epoch - pcpu_sched->epoch_timeout) > llc_epoch_affinity_timeout) {
+ cpumask_clear_cpu(cpu, &mm->sc_stat.visited_cpus);

This makes me think your issue is more with pcpu_sched->runtime
not decaying fast enough?

If you haven't run in 50us, the pcpu_sched->runtime would already be
decayed by >> 5 but that is not enough?

If I understanding correctly, Gengkun intended to reduce the number of
CPUs being scanned. If pcpu_sched->runtime has decayed by 32, the existing code may still check that CPU. And in the visited_cpu proposed
here, I suppose those CPUs will be skipped?

for_each_cpu(cpu, cpus) {
/* XXX sched_cluster_active */
@@ -1878,18 +1848,21 @@ static void task_cache_work(struct callback_head *work)
continue;
for_each_cpu(i, sched_domain_span(sd)) {

What does your system topology look like? We visit all CPUs of LLC here
so even if a single CPU of LLC is set in visited, you'll still visit all
CPUs of LLC here ...

Yeah, maybe for_each_cpu_and(i, sched_domain_span(sd), cpus)

thanks,
Chenyu