Re: [RFC][PATCH] sched: Cache aware load-balancing

From: Chen, Yu C
Date: Sat Mar 29 2025 - 11:06:36 EST

Next message: Alessandro Carminati: "[PATCH] kunit: fixes backtrace suppression test module description"
Previous message: Alessandro Carminati: "[PATCH] kunit: fixes Compilation error on s390"
In reply to: Abel Wu: "Re: [RFC][PATCH] sched: Cache aware load-balancing"
Next in thread: Abel Wu: "Re: Re: [RFC][PATCH] sched: Cache aware load-balancing"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 3/28/2025 9:57 PM, Abel Wu wrote:

Hi Peter,

On 3/25/25 8:09 PM, Peter Zijlstra wrote:

struct mmu_gather;
extern void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm);
extern void tlb_gather_mmu_fullmm(struct mmu_gather *tlb, struct mm_struct *mm);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6e5c38718ff5..f8eafe440369 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1379,6 +1379,10 @@ struct task_struct {
unsigned long numa_pages_migrated;
#endif /* CONFIG_NUMA_BALANCING */
+#ifdef CONFIG_SCHED_CACHE
+ struct callback_head cache_work;
+#endif

IIUC this work updates stats for the whole mm and seems not
necessary for each task of the process to repeat same thing.
Hence would be better move this work to mm_struct.

It seems that the per task cache_work is used to avoid
duplicated task_cache_work() in task->task_works queue, see
task_tick_cache()'s check
if (work->next == work)
task_work_add()

To do exclusive task_cache_work() and only allow 1 task
in the process to do the calculation, maybe introducing similar mechanism like task_numa_work(), something like this:

if (!try_cmpxchg(&mm->cache_next_scan, &calc, next_scan))
return;

+
+static inline
+void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec)
+{
+    struct mm_struct *mm = p->mm;
+    struct mm_sched *pcpu_sched;
+    unsigned long epoch;
+
+    /*
+     * init_task and kthreads don't be having no mm
+     */
+    if (!mm || !mm->pcpu_sched)
+        return;
+
+    pcpu_sched = this_cpu_ptr(p->mm->pcpu_sched);
+
+    scoped_guard (raw_spinlock, &rq->cpu_epoch_lock) {
+        __update_mm_sched(rq, pcpu_sched);
+        pcpu_sched->runtime += delta_exec;
+        rq->cpu_runtime += delta_exec;
+        epoch = rq->cpu_epoch;
+    }
+
+    /*
+     * If this task hasn't hit task_cache_work() for a while, invalidate
+     * it's preferred state.
+     */
+    if (epoch - READ_ONCE(mm->mm_sched_epoch) > EPOCH_OLD) {
+        mm->mm_sched_cpu = -1;
+        pcpu_sched->occ = -1;
+    }

This seems too late. account_mm_sched() is called when p is runnable,
so if the whole process sleeps for a while before woken up, ttwu will
take the out-dated value.

Yup, there seems to be a problem. It would be better if we could reset the mm_sched_cpu to -1 after the last thread of the process falls asleep. But considering that all threads are sleeping, even if the ttwu tries to enqueue the first newly-woken thread to an out-dated idle mm_sched_cpu, does it matter? I guess it would not be a serious problem, because all the cache of the process might have been evicted due to long sleep.

+
+static void task_cache_work(struct callback_head *work)
+{
+    struct task_struct *p = current;
+    struct mm_struct *mm = p->mm;
+    unsigned long m_a_occ = 0;
+    int cpu, m_a_cpu = -1;
+    cpumask_var_t cpus;
+
+    WARN_ON_ONCE(work != &p->cache_work);
+
+    work->next = work;
+
+    if (p->flags & PF_EXITING)
+        return;
+
+    if (!alloc_cpumask_var(&cpus, GFP_KERNEL))
+        return;
+
+    scoped_guard (cpus_read_lock) {
+        cpumask_copy(cpus, cpu_online_mask);
+
+        for_each_cpu(cpu, cpus) {
+            /* XXX sched_cluster_active */
+            struct sched_domain *sd = per_cpu(sd_llc, cpu);
+            unsigned long occ, m_occ = 0, a_occ = 0;
+            int m_cpu = -1, nr = 0, i;
+
+            for_each_cpu(i, sched_domain_span(sd)) {
+                occ = fraction_mm_sched(cpu_rq(i),
+                            per_cpu_ptr(mm->pcpu_sched, i));
+                a_occ += occ;
+                if (occ > m_occ) {
+                    m_occ = occ;
+                    m_cpu = i;
+                }

It would be possible to cause task stacking on this hint cpu
due to its less frequently updated compared to wakeup.

The SIS would overwrite the prev CPU with this hint(cached) CPU, and use that cached CPU as a hint to search for an idle CPU, so ideally it should not cause task stacking. But there is a race condition that multiple wakeup path might find the same cached "idle" CPU and queue wakees on it, this usually happens when there is frequent context switch(wakeup)+short duration tasks.

And although the occupancy heuristic looks reasonable, IMHO it
doesn't make much sense to compare between cpus as they share
the LLC, and a non-hint cpu with warmer L1/L2$ in same LLC with
the hint cpu seems more preferred.

Do you think it's appropriate or not to only hint on the hottest
LLC? So the tasks can hopefully wokenup on 'right' LLC on the
premise that wouldn't cause much imbalance between LLCs.

> I will do some tests and return with more feedback.
>

Find an idle CPU in the wakee's hostest LLC seems to be plausible.
The benchmark data might indicate a proper way.

thanks,
Chenyu

Next message: Alessandro Carminati: "[PATCH] kunit: fixes backtrace suppression test module description"
Previous message: Alessandro Carminati: "[PATCH] kunit: fixes Compilation error on s390"
In reply to: Abel Wu: "Re: [RFC][PATCH] sched: Cache aware load-balancing"
Next in thread: Abel Wu: "Re: Re: [RFC][PATCH] sched: Cache aware load-balancing"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]