Re: [RFC patch v3 01/20] sched: Cache aware load-balancing

From: Chen, Yu C
Date: Thu Jun 26 2025 - 09:33:35 EST

Next message: syzbot: "[syzbot] [hams?] possible deadlock in nr_rt_ioctl (2)"
Previous message: Sean Christopherson: "Re: [tip: x86/urgent] x86/traps: Initialize DR7 by writing its architectural reset value"
In reply to: Jianyong Wu: "Re: [RFC patch v3 01/20] sched: Cache aware load-balancing"
Next in thread: Tim Chen: "Re: [RFC patch v3 01/20] sched: Cache aware load-balancing"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 6/26/2025 8:23 PM, Jianyong Wu wrote:

Hi Tim,

On 6/19/2025 2:27 AM, Tim Chen wrote:

From: Peter Zijlstra <peterz@xxxxxxxxxxxxx>

Hi all,

One of the many things on the eternal todo list has been finishing the
below hackery.

It is an attempt at modelling cache affinity -- and while the patch
really only targets LLC, it could very well be extended to also apply to
clusters (L2). Specifically any case of multiple cache domains inside a
node.

Anyway, I wrote this about a year ago, and I mentioned this at the
recent OSPM conf where Gautham and Prateek expressed interest in playing
with this code.

So here goes, very rough and largely unproven code ahead :-)

It applies to current tip/master, but I know it will fail the __percpu
validation that sits in -next, although that shouldn't be terribly hard
to fix up.

As is, it only computes a CPU inside the LLC that has the highest recent
runtime, this CPU is then used in the wake-up path to steer towards this
LLC and in task_hot() to limit migrations away from it.

More elaborate things could be done, notably there is an XXX in there
somewhere about finding the best LLC inside a NODE (interaction with
NUMA_BALANCING).

Signed-off-by: Peter Zijlstra (Intel) <peterz@xxxxxxxxxxxxx>
---
include/linux/mm_types.h | 44 ++++++
include/linux/sched.h    |   4 +
init/Kconfig             |   4 +
kernel/fork.c            |   5 +
kernel/sched/core.c      | 13 +-
kernel/sched/fair.c      | 330 +++++++++++++++++++++++++++++++++++++--
kernel/sched/sched.h     |   8 +
7 files changed, 388 insertions(+), 20 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 56d07edd01f9..013291c6aaa2 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -893,6 +893,12 @@ struct mm_cid {
};
#endif

+static void task_cache_work(struct callback_head *work)
+{
+    struct task_struct *p = current;
+    struct mm_struct *mm = p->mm;
+    unsigned long m_a_occ = 0;
+    int cpu, m_a_cpu = -1;
+    cpumask_var_t cpus;
+
+    WARN_ON_ONCE(work != &p->cache_work);
+
+    work->next = work;
+
+    if (p->flags & PF_EXITING)
+        return;
+
+    if (!alloc_cpumask_var(&cpus, GFP_KERNEL))
+        return;
+
+    scoped_guard (cpus_read_lock) {
+        cpumask_copy(cpus, cpu_online_mask);
+
+        for_each_cpu(cpu, cpus) {
+            /* XXX sched_cluster_active */
+            struct sched_domain *sd = per_cpu(sd_llc, cpu);
+            unsigned long occ, m_occ = 0, a_occ = 0;
+            int m_cpu = -1, nr = 0, i;
+
+            for_each_cpu(i, sched_domain_span(sd)) {
+                occ = fraction_mm_sched(cpu_rq(i),
+                            per_cpu_ptr(mm->pcpu_sched, i));
+                a_occ += occ;
+                if (occ > m_occ) {
+                    m_occ = occ;
+                    m_cpu = i;
+                }
+                nr++;
+                trace_printk("(%d) occ: %ld m_occ: %ld m_cpu: %d nr: %d\n",
+                         per_cpu(sd_llc_id, i), occ, m_occ, m_cpu, nr);
+            }
+
+            a_occ /= nr;
+            if (a_occ > m_a_occ) {
+                m_a_occ = a_occ;
+                m_a_cpu = m_cpu;
+            }
+
+            trace_printk("(%d) a_occ: %ld m_a_occ: %ld\n",
+                     per_cpu(sd_llc_id, cpu), a_occ, m_a_occ);
+
+            for_each_cpu(i, sched_domain_span(sd)) {
+                /* XXX threshold ? */
+                per_cpu_ptr(mm->pcpu_sched, i)->occ = a_occ;
+            }
+
+            cpumask_andnot(cpus, cpus, sched_domain_span(sd));
+        }
+    }
+
+    /*
+     * If the max average cache occupancy is 'small' we don't care.
+     */
+    if (m_a_occ < (NICE_0_LOAD >> EPOCH_OLD))
+        m_a_cpu = -1;
+
+    mm->mm_sched_cpu = m_a_cpu;
+
+    free_cpumask_var(cpus);
+}
+

This task work may take a long time for the system with large number cpus which increacing the delay for process back to userspace. It may be the reason that schbench benchmark regressed so much.

Thanks for the insight Jianyong, yes, the scan on all online CPUs would
be costly.

To avoid searching the whole system, what about just searching the preferred numa node provided by numa balancing if there is one. If not, then fallback to search the whole system or just search the numa node where the main process locates as there is a high probability it contains the preferred LLC. In other words, we can opt for a suboptimal LLC location to prioritize speed.

WDYT?

This is a good idea. Previously, Tim had a version that dealt with a
similar scenario, which only scanned the CPUs within p's preferred node.
However, it seems to cause bouncing of the mm->mm_sched_cpu because we
set a 2X threshold for switching the mm->mm_sched_cpu in patch 5. If the
old mm_sched_cpu is not in p's current preferred node, last_m_a_occ is
always 0, which makes the switching of mm->mm_sched_cpu always succeed
due to the condition if (m_a_occ > (2 * last_m_a_occ)). Anyway, since it
is a software issue, we can find a way to address it.

Maybe we also following Abel's suggestion that only one thread of
the process is allowed to perform the statistic calculation, this
could minimal the negative impact to the whole process.

thanks,
Chenyu

Next message: syzbot: "[syzbot] [hams?] possible deadlock in nr_rt_ioctl (2)"
Previous message: Sean Christopherson: "Re: [tip: x86/urgent] x86/traps: Initialize DR7 by writing its architectural reset value"
In reply to: Jianyong Wu: "Re: [RFC patch v3 01/20] sched: Cache aware load-balancing"
Next in thread: Tim Chen: "Re: [RFC patch v3 01/20] sched: Cache aware load-balancing"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]