[Patch v4 02/22] sched/cache: Limit the scan number of CPUs when calculating task occupancy

From: Tim Chen

Date: Wed Apr 01 2026 - 17:47:32 EST

From: Chen Yu <yu.c.chen@xxxxxxxxx>

When NUMA balancing is enabled, the kernel currently iterates over all
online CPUs to aggregate process-wide occupancy data. On large systems,
this global scan introduces significant overhead.

To reduce scan latency, limit the search to a subset of relevant CPUs:
1. The task's preferred NUMA node.
2. The node where the task is currently running.
3. The node that contains the task's current preferred LLC..

While focusing solely on the preferred NUMA node is ideal, a
process-wide scan must remain flexible because the "preferred node"
is a per-task attribute. Different threads within the same process may
have different preferred nodes, causing the process-wide preference to
migrate. Maintaining a mask that covers both the preferred and active
running nodes ensures accuracy while significantly reducing the number of
CPUs inspected.

Future work may integrate numa_group to further refine task aggregation.

Suggested-by: Madadi Vineeth Reddy <vineethr@xxxxxxxxxxxxx>
Signed-off-by: Chen Yu <yu.c.chen@xxxxxxxxx>
Co-developed-by: Tim Chen <tim.c.chen@xxxxxxxxxxxxxxx>
Signed-off-by: Tim Chen <tim.c.chen@xxxxxxxxxxxxxxx>
---

Notes:
v3->v4:
New patch.

kernel/sched/fair.c | 46 ++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 45 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index eb3cfb852a93..20a33900f4ea 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1431,6 +1431,50 @@ static void task_tick_cache(struct rq *rq, struct task_struct *p)
}
}

+static void get_scan_cpumasks(cpumask_var_t cpus, struct task_struct *p)
+{
+#ifdef CONFIG_NUMA_BALANCING
+ int cpu, curr_cpu, pref_nid;
+
+ if (!static_branch_likely(&sched_numa_balancing))
+ goto out;
+
+ cpu = p->mm->sc_stat.cpu;
+ curr_cpu = task_cpu(p);
+
+ /*
+ * Scanning in the preferred NUMA node is ideal. However, the NUMA
+ * preferred node is per-task rather than per-process. It is possible
+ * for different threads of the process to have distinct preferred
+ * nodes; consequently, the process-wide preferred LLC may bounce
+ * between different nodes. As a workaround, maintain the scan
+ * CPU mask to also cover the process's current preferred LLC and the
+ * current running node to mitigate the bouncing risk.
+ * TBD: numa_group should be considered during task aggregation.
+ */
+ pref_nid = p->numa_preferred_nid;
+ /* honor the task's preferred node */
+ if (pref_nid == NUMA_NO_NODE)
+ goto out;
+
+ cpumask_or(cpus, cpus, cpumask_of_node(pref_nid));
+
+ /* honor the task's preferred LLC CPU */
+ if (cpu != -1 && !cpumask_test_cpu(cpu, cpus))
+ cpumask_or(cpus, cpus,
+ cpumask_of_node(cpu_to_node(cpu)));
+
+ /* make sure the task's current running node is included */
+ if (!cpumask_test_cpu(curr_cpu, cpus))
+ cpumask_or(cpus, cpus, cpumask_of_node(cpu_to_node(curr_cpu)));
+
+ return;
+
+out:
+#endif
+ cpumask_copy(cpus, cpu_online_mask);
+}
+
static void task_cache_work(struct callback_head *work)
{
struct task_struct *p = current;
@@ -1451,7 +1495,7 @@ static void task_cache_work(struct callback_head *work)
return;

scoped_guard (cpus_read_lock) {
- cpumask_copy(cpus, cpu_online_mask);
+ get_scan_cpumasks(cpus, p);

for_each_cpu(cpu, cpus) {
/* XXX sched_cluster_active */
--
2.32.0