Re: [PATCH v4 1/2] sched/cache: Reduce the overhead of task_cache_work by only scan the visisted cpus

From: K Prateek Nayak

Date: Fri Jun 19 2026 - 02:54:57 EST

Hello Luo,

On 6/18/2026 12:12 PM, Luo Gengkun wrote:
> The overhead of task_cache_work() is high, especially in multi-NUMA
> systems. Currently, task_cache_work() tries to find the pref_llc by
> scanning all CPUs in the system. However, most of these scans are
> meaningless, such as those for CPUs that have never been visited or were
> accessed a long time ago.
>
> To address this problem, introduce visited_cpus to track the visited CPUs
> and evict them once they have not been accessed for a duration exceeding
> llc_epoch_affinity_timeout. With this patch, get_scan_cpumasks() is no
> longer need and is therefore removed.

Please include performance numbers here.

I'm not convinced by the amount of benefits this brings. For hackbench,
the only stable improvement I can see from the cover letter is:

threads 1 4 | 27.758 (1.64%) | 25.711 (1.32%) | 7.37% | IMPROVED

Since you mention Redis may benefit from this, do you actually have
numbers for Redis, or any other real world workload on your system?

[..snip..]

> @@ -1635,11 +1637,21 @@ static inline void __update_mm_sched(struct rq *rq,
> }
> }
>
> -static unsigned long fraction_mm_sched(struct rq *rq,
> - struct sched_cache_time *pcpu_sched)
> +static unsigned long fraction_mm_sched(int cpu,
> + struct mm_struct *mm)
> {
> + struct sched_cache_time *pcpu_sched =
> + per_cpu_ptr(mm->sc_stat.pcpu_sched, cpu);
> + struct rq *rq = cpu_rq(cpu);
> +
> guard(raw_spinlock_irqsave)(&rq->cpu_epoch_lock);
>
> + /* Skip the rq that has not been hit for a long time */
> + if ((rq->cpu_epoch - pcpu_sched->epoch_timeout) > llc_epoch_affinity_timeout) {
> + cpumask_clear_cpu(cpu, &mm->sc_stat.visited_cpus);

This makes me think your issue is more with pcpu_sched->runtime
not decaying fast enough?

If you haven't run in 50us, the pcpu_sched->runtime would already be
decayed by >> 5 but that is not enough?

> + return 0;
> + }
> +
> __update_mm_sched(rq, pcpu_sched);
>
> /*

> static inline void update_avg_scale(u64 *avg, u64 sample)
> {
> int factor = per_cpu(sd_llc_size, raw_smp_processor_id());
> @@ -1866,7 +1836,7 @@ static void task_cache_work(struct callback_head *work)
> scoped_guard (cpus_read_lock) {
> guard(rcu)();
>
> - get_scan_cpumasks(cpus, p);
> + cpumask_and(cpus, cpu_online_mask, &mm->sc_stat.visited_cpus);

Doesn't this violate NUMA_BALANCING constraints? That was the whole point
of get_scan_cpumasks()

>
> for_each_cpu(cpu, cpus) {
> /* XXX sched_cluster_active */
> @@ -1878,18 +1848,21 @@ static void task_cache_work(struct callback_head *work)
> continue;
>
> for_each_cpu(i, sched_domain_span(sd)) {

What does your system topology look like? We visit all CPUs of LLC here
so even if a single CPU of LLC is set in visited, you'll still visit all
CPUs of LLC here ...

> - occ = fraction_mm_sched(cpu_rq(i),
> - per_cpu_ptr(mm->sc_stat.pcpu_sched, i));
> + cur = rcu_dereference_all(cpu_rq(i)->curr);
> + if (cur && !(cur->flags & (PF_EXITING | PF_KTHREAD)) &&
> + cur->mm == mm)
> + nr_running++;
> +
> + occ = fraction_mm_sched(i, mm);

... and all this does is return 0 if the mm hasn't run on that CPU for
over 50us. All this is pointing to more aggressive decay maybe somehow
helping you.

> + if (occ == 0)
> + continue;
> +
--
Thanks and Regards,
Prateek