Re: [RFC][PATCH] sched: Cache aware load-balancing

From: Chen, Yu C
Date: Mon Mar 31 2025 - 02:25:59 EST


On 3/27/2025 7:20 PM, Hillf Danton wrote:
On Wed, Mar 26, 2025 at 11:25:53AM +0100, Peter Zijlstra wrote:
On Wed, Mar 26, 2025 at 10:38:41AM +0100, Peter Zijlstra wrote:

Nah, the saner thing to do is to preserve the topology averages and look
at those instead of the per-cpu values.

Eg. have task_cache_work() compute and store averages in the
sched_domain structure and then use those.

A little something like so perhaps ?

My $.02 followup with the assumption that l2 cache temperature can not
make sense without comparing. Just for idea show.

Hillf

--- m/include/linux/sched.h
+++ n/include/linux/sched.h
@@ -1355,6 +1355,11 @@ struct task_struct {
unsigned long numa_pages_migrated;
#endif /* CONFIG_NUMA_BALANCING */
+#ifdef CONFIG_SCHED_CACHE
+#define LXC_SIZE 64 /* should be setup by parsing topology */
+ unsigned long lxc_temp[LXC_SIZE]; /* x > 1, l2 cache temperature for instance */
+#endif
+
#ifdef CONFIG_RSEQ
struct rseq __user *rseq;
u32 rseq_len;
--- m/kernel/sched/fair.c
+++ n/kernel/sched/fair.c
@@ -7953,6 +7953,22 @@ static int select_idle_sibling(struct ta
if ((unsigned)i < nr_cpumask_bits)
return i;
+#ifdef CONFIG_SCHED_CACHE
+ /*
+ * 2, lxc temp can not make sense without comparing
+ *
+ * target can be any cpu if lxc is cold
+ */
+ if ((unsigned int)prev_aff < nr_cpumask_bits)
+ if (p->lxc_temp[per_cpu(sd_share_id, (unsigned int)prev_aff)] >
+ p->lxc_temp[per_cpu(sd_share_id, target)])
+ target = prev_aff;
+ if ((unsigned int)recent_used_cpu < nr_cpumask_bits)
+ if (p->lxc_temp[per_cpu(sd_share_id, (unsigned int)recent_used_cpu)] >
+ p->lxc_temp[per_cpu(sd_share_id, target)])
+ target = recent_used_cpu;
+ p->lxc_temp[per_cpu(sd_share_id, target)] += 1;
+#else
/*
* For cluster machines which have lower sharing cache like L2 or
* LLC Tag, we tend to find an idle CPU in the target's cluster
@@ -7963,6 +7979,7 @@ static int select_idle_sibling(struct ta
return prev_aff;
if ((unsigned int)recent_used_cpu < nr_cpumask_bits)
return recent_used_cpu;
+#endif
return target;
}
@@ -13059,6 +13076,13 @@ static void task_tick_fair(struct rq *rq
if (static_branch_unlikely(&sched_numa_balancing))
task_tick_numa(rq, curr);
+#ifdef CONFIG_SCHED_CACHE
+ /*
+ * 0, lxc is defined cold after 2-second nap
+ * 1, task migrate across NUMA node makes lxc cold
+ */
+ curr->lxc_temp[per_cpu(sd_share_id, rq->cpu)] += 5;

If lxc_temp is per task, this might be of another direction that to track each task's activity rather than the whole process activity.
The idea I think it is applicable to overwrite target to other CPU
if the latter is in a hot LLC, so select_idle_cpu() can search for
an idle CPU in cache hot LLC.

thanks,
Chenyu