[RFC PATCH 2/8] sched/fair: Introduce helpers for cross-domain migration decisions

From: Jianyong Wu

Date: Wed Jun 24 2026 - 23:09:12 EST

Cache-aware scheduling makes migration decisions purely based on LLC
affinity, only permitting moves to a task's preferred LLC. This rigid
policy discards critical topology information including NUMA distances.

To leverage NUMA distance metrics, expand the original LLC-only scope
to the unified scheduling domain abstraction. A scheduling domain can
represent an LLC, a single NUMA node, or a cluster of multiple NUMA
nodes, covering all hierarchy tiers above the LLC level.

Add helper routines to check if a target scheduling domain can hold
the migrating task. We attempt to place tasks within the lowest-level
available domain first; if the lower domain reaches capacity, the logic
falls back to the next upper scheduling domain tier.

Signed-off-by: Jianyong Wu <wujianyong@xxxxxxxx>
---
kernel/sched/fair.c | 101 ++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 101 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d78467ec6ee1..dfca39c63333 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10563,6 +10563,107 @@ static enum llc_mig can_migrate_llc(int src_cpu, int dst_cpu,
return mig_llc;
}

+/*
+ * Like get_llc_stats but for sched domain that above LLC level.
+ * Based on get_llc_stats, we can accumulate utility and cap for
+ * sched domain in the granularity of LLC.
+ */
+static bool get_sd_stats(struct sched_domain *sd, unsigned long *util_out, unsigned long *cap_out)
+{
+ struct cpumask mask;
+ int cpu;
+ unsigned long util_tmp, cap_tmp, util = 0, cap = 0;
+ struct sched_domain *sd_tmp;
+
+ if (!sd || !util_out || !cap_out)
+ return false;
+
+ cpumask_copy(&mask, sched_domain_span(sd));
+ for_each_cpu(cpu, &mask) {
+ if (!get_llc_stats(cpu, &util_tmp, &cap_tmp))
+ return false;
+
+ sd_tmp = rcu_dereference(per_cpu(sd_llc, cpu));
+ cpumask_andnot(&mask, &mask, sched_domain_span(sd_tmp));
+ util += util_tmp;
+ cap += cap_tmp;
+ }
+
+ *util_out = util;
+ *cap_out = cap;
+
+ return true;
+}
+
+/* Decide if a sched domain is overload. */
+static bool is_domain_overload(struct sched_domain *sd)
+{
+ int ret;
+ unsigned long util = 0, cap = 0;
+
+ get_sd_stats(sd, &util, &cap);
+
+ ret = !fits_llc_capacity(util, cap);
+
+ return ret;
+}
+
+/*
+ * Decide if migration should happen on a specific node.
+ * The node here is a generic conception for a set of cpu.
+ * It usually indicates one of sched domain for LLC level and above.
+ */
+static enum llc_mig __maybe_unused can_migrate_node(int src_cpu, int dst_cpu,
+ struct task_struct *p, bool to_pref)
+{
+ struct sched_domain *domain;
+ unsigned long dst_util, dst_cap, tsk_util = 0;
+ int k = 0;
+
+ if (!get_llc_stats(dst_cpu, &dst_util, &dst_cap))
+ return mig_unrestricted;
+
+ if (p)
+ tsk_util = task_util(p);
+
+ dst_util = dst_util + tsk_util;
+
+ if (to_pref) {
+ if (fits_llc_capacity(dst_util, dst_cap))
+ return mig_llc;
+ else
+ return mig_unrestricted;
+ }
+ /*
+ * If the dest node decrase locality, decide if it should migrate by testing that
+ * if it is the closest place that is not overload.
+ */
+ for_each_domain(src_cpu, domain) {
+ /* Skip sched domain at MC and below */
+ if (domain->flags & SD_SHARE_LLC)
+ continue;
+
+ /* Allow migration if we found dest cpu in this sched domain */
+ if (cpumask_test_cpu(dst_cpu, sched_domain_span(domain)))
+ return mig_llc;
+
+ /*
+ * For the special case: the workload is small and the dest cpu may far away
+ * from src cpu. If the current node is capable for the load but overload
+ * while the remote node is capable for the load and not overload. Give a
+ * chance for the remote node.
+ */
+ if (p && (domain->span_weight > get_nr_threads(p) && k++))
+ return mig_unrestricted;
+
+ /* Don't migrate if there is a better place to live */
+ if (!is_domain_overload(domain))
+ return mig_forbid;
+ }
+
+ return mig_unrestricted;
+}
+
/*
* Check if task p can migrate from source LLC to
* destination LLC in terms of cache aware load balance.
--
2.34.1