Re: [PATCH v2 01/23] sched/cache: Introduce infrastructure for cache-aware load balancing
From: Vern Hao
Date: Tue Dec 16 2025 - 20:17:38 EST
On 2025/12/16 14:12, Chen, Yu C wrote:
On 12/11/2025 5:03 PM, Vern Hao wrote:
Hi, Peter, Chen Yu and Tim:
On 2025/12/4 07:07, Tim Chen wrote:
From: "Peter Zijlstra (Intel)" <peterz@xxxxxxxxxxxxx>
Adds infrastructure to enable cache-aware load balancing,
which improves cache locality by grouping tasks that share resources
within the same cache domain. This reduces cache misses and improves
overall data access efficiency.
In this initial implementation, threads belonging to the same process
are treated as entities that likely share working sets. The mechanism
tracks per-process CPU occupancy across cache domains and attempts to
migrate threads toward cache-hot domains where their process already
has active threads, thereby enhancing locality.
This provides a basic model for cache affinity. While the current code
targets the last-level cache (LLC), the approach could be extended to
other domain types such as clusters (L2) or node-internal groupings.
At present, the mechanism selects the CPU within an LLC that has the
highest recent runtime. Subsequent patches in this series will use this
information in the load-balancing path to guide task placement toward
preferred LLCs.
In the future, more advanced policies could be integrated through NUMA
balancing-for example, migrating a task to its preferred LLC when spare
capacity exists, or swapping tasks across LLCs to improve cache
affinity.
Grouping of tasks could also be generalized from that of a process
to be that of a NUMA group, or be user configurable.
Originally-by: Peter Zijlstra (Intel) <peterz@xxxxxxxxxxxxx>
Signed-off-by: Chen Yu <yu.c.chen@xxxxxxxxx>
Signed-off-by: Tim Chen <tim.c.chen@xxxxxxxxxxxxxxx>
---
Notes:
v1->v2:
Restore the original CPU scan to cover all online CPUs,
rather than scanning within the preferred NUMA node.
(Peter Zijlstra)
Use rq->curr instead of rq->donor. (K Prateek Nayak)
Minor fix in task_tick_cache() to use
if (mm->mm_sched_epoch >= rq->cpu_epoch)
to avoid mm_sched_epoch going backwards.
include/linux/mm_types.h | 44 +++++++
include/linux/sched.h | 11 ++
init/Kconfig | 11 ++
kernel/fork.c | 6 +
kernel/sched/core.c | 6 +
kernel/sched/fair.c | 258
+++++++++++++++++++++++++++++++++++++++
kernel/sched/sched.h | 8 ++
7 files changed, 344 insertions(+)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 90e5790c318f..1ea16ef90566 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -939,6 +939,11 @@ typedef struct {
DECLARE_BITMAP(__mm_flags, NUM_MM_FLAG_BITS);
} __private mm_flags_t;
+struct mm_sched {
+ u64 runtime;
+ unsigned long epoch;
+};
+
struct kioctx_table;
struct iommu_mm_data;
struct mm_struct {
@@ -1029,6 +1034,17 @@ struct mm_struct {
*/
raw_spinlock_t cpus_allowed_lock;
#endif
+#ifdef CONFIG_SCHED_CACHE
+ /*
+ * Track per-cpu-per-process occupancy as a proxy for cache
residency.
+ * See account_mm_sched() and ...
+ */
+ struct mm_sched __percpu *pcpu_sched;
+ raw_spinlock_t mm_sched_lock;
+ unsigned long mm_sched_epoch;
+ int mm_sched_cpu;
As we discussed earlier,I continue to believe that dedicating
'mm_sched_cpu' to handle the aggregated hotspots of all threads is
inappropriate, as the multiple threads lack a necessary correlation
in our real application.
So, I was wondering if we could put this variable into struct
task_struct, That allows us to better monitor the hotspot CPU of each
thread, despite some details needing consideration.
I suppose you are suggesting a fine-grained control for a set of tasks.
Process-scope aggregation could be a start as the default strategy(
conservative, benefit multi-thread workloads that share data per process,
not introduce regression).
Yes, in our real-world business scenarios at Tencent, I have indeed
encountered this issue where multiple threads are divided into several
categories to handle different transactions, so they are not share the
hot data, the 'mm_sched_cpu' does not represent all of their task, so
add a control interface such as cgroup or others will be a good idea.
On top of that, I wonder if we could provide task-scope control like
sched_setattr(), similar to core-scheduling cookie mechanism, for
users that want aggressive aggregation. But before doing that, we need a
mechanism that that leverages a monitor system(like PMU) to figure out
There will maybe a trouble, If the environment is running on a VM, We
could use tags to differentiate these tasks and do some tests to verify
the performance difference between unifying the |mm_sched_cpu| and not
unifying.
if putting these tasks together would bring benefit(if I understand
Steven's suggestion correctly on LPC), or detection tasks that share
resource, then maybe leverage QOS interfaces to enable the cache-aware
aggregation(something Qias mentioned on the LPC).
thanks,
Chenyu