Re: [RFC][PATCH] sched: Cache aware load-balancing

From: Chen, Yu C
Date: Mon Mar 31 2025 - 01:26:01 EST


On 3/30/2025 4:46 PM, Abel Wu wrote:
On 3/29/25 11:06 PM, Chen, Yu C wrote:
On 3/28/2025 9:57 PM, Abel Wu wrote:

IIUC this work updates stats for the whole mm and seems not
necessary for each task of the process to repeat same thing.
Hence would be better move this work to mm_struct.


It seems that the per task cache_work is used to avoid
duplicated task_cache_work() in task->task_works queue, see
task_tick_cache()'s check
  if (work->next == work)
     task_work_add()

Yes, this check avoids task_works being corrupt caused by double
adding the work, no matter this work is task- or mm-specific. So
if moving to mm_struct, this check become false once any task takes
this work, and other tasks can not do the same job until that task
finishes by setting work->next to work.

I see. What I previous thought was that, checking work->next == work
can not avoid two tasks doing the same statistic calculation
in task_cache_work(), because work->next = work is invoked
at the beginning of task_cache_work() - maybe need to put
this at the end of task_cache_work()?


BTW, moving to mm_struct will save some memory footprint?
Yes, this would help.


To do exclusive task_cache_work() and only allow 1 task
in the process to do the calculation, maybe introducing similar mechanism like task_numa_work(), something like this:

if (!try_cmpxchg(&mm->cache_next_scan, &calc, next_scan))
     return;

Yes, this looks good to me. While Peter used epoch to regulate
the frequency of this work, which is a ligher way but allows some
inaccuracy which is further fixed by a double check after holding
mm->mm_sched_lock.

I plan to use trylock on mm->mm_sched_lock first. If trylock fails
then someone is adding the work and we can skip early.

static void task_tick_cache(struct rq *rq, struct task_struct *p)
{
    ...

    if (mm->mm_sched_epoch == rq->cpu_epoch)
        return;

    guard(raw_spinlock)(&mm->mm_sched_lock);  <-- trylock

    ...
}


This lock is intended to protect that no two tasks enqueuing the same per-mm_struct work at the same time, right? And for the task work execution phase, maybe we also need to put work->next = work at the end of task_cache_work()?


+
+static inline
+void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec)
+{
+    struct mm_struct *mm = p->mm;
+    struct mm_sched *pcpu_sched;
+    unsigned long epoch;
+
+    /*
+     * init_task and kthreads don't be having no mm
+     */
+    if (!mm || !mm->pcpu_sched)
+        return;
+
+    pcpu_sched = this_cpu_ptr(p->mm->pcpu_sched);
+
+    scoped_guard (raw_spinlock, &rq->cpu_epoch_lock) {
+        __update_mm_sched(rq, pcpu_sched);
+        pcpu_sched->runtime += delta_exec;
+        rq->cpu_runtime += delta_exec;
+        epoch = rq->cpu_epoch;
+    }
+
+    /*
+     * If this task hasn't hit task_cache_work() for a while, invalidate
+     * it's preferred state.
+     */
+    if (epoch - READ_ONCE(mm->mm_sched_epoch) > EPOCH_OLD) {
+        mm->mm_sched_cpu = -1;
+        pcpu_sched->occ = -1;
+    }

This seems too late. account_mm_sched() is called when p is runnable,
so if the whole process sleeps for a while before woken up, ttwu will
take the out-dated value.


Yup, there seems to be a problem. It would be better if we could reset the mm_sched_cpu to -1 after the last thread of the process falls asleep. But considering that all threads are sleeping, even if the ttwu tries to enqueue the first newly-woken thread to an out-dated idle mm_sched_cpu, does it matter? I guess it would not be a serious problem, because all the cache of the process might have been evicted due to long sleep.

Yes, it seems not a big deal if mm_sched_cpu not overwrites any better
choice.


+
+static void task_cache_work(struct callback_head *work)
+{
+    struct task_struct *p = current;
+    struct mm_struct *mm = p->mm;
+    unsigned long m_a_occ = 0;
+    int cpu, m_a_cpu = -1;
+    cpumask_var_t cpus;
+
+    WARN_ON_ONCE(work != &p->cache_work);
+
+    work->next = work;
+
+    if (p->flags & PF_EXITING)
+        return;
+
+    if (!alloc_cpumask_var(&cpus, GFP_KERNEL))
+        return;
+
+    scoped_guard (cpus_read_lock) {
+        cpumask_copy(cpus, cpu_online_mask);
+
+        for_each_cpu(cpu, cpus) {
+            /* XXX sched_cluster_active */
+            struct sched_domain *sd = per_cpu(sd_llc, cpu);
+            unsigned long occ, m_occ = 0, a_occ = 0;
+            int m_cpu = -1, nr = 0, i;
+
+            for_each_cpu(i, sched_domain_span(sd)) {
+                occ = fraction_mm_sched(cpu_rq(i),
+                            per_cpu_ptr(mm->pcpu_sched, i));
+                a_occ += occ;
+                if (occ > m_occ) {
+                    m_occ = occ;
+                    m_cpu = i;
+                }

It would be possible to cause task stacking on this hint cpu
due to its less frequently updated compared to wakeup.


The SIS would overwrite the prev CPU with this hint(cached) CPU, and use that cached CPU as a hint to search for an idle CPU, so ideally it should not cause task stacking. But there is a race condition that multiple wakeup path might find the same cached "idle" CPU and queue wakees on it, this usually happens when there is frequent context switch(wakeup)+short duration tasks.

Ideally mm_sched_cpu is updated every EPOCH_PERIOD which is default
to 10ms, so I'm afraid the race window is not small.

OK, understood. My thought was that, if many wakers start searching for idle CPU from the same mm_sched_cpu(because mm_sched_cpu has not been changed for 10ms), if waker1 succeeds to enqueue a long-running wakee1 on that mm_sched_cpu, other wakers might stop choosing that mm_sched_cpu in SIS. But if all the wakees are short-duration ones and doing frequent context switches, many wakers would think that they find an "idle" mm_sched_cpu, but actually it is heavily contended by many wakers.

thanks,
Chenyu



And although the occupancy heuristic looks reasonable, IMHO it
doesn't make much sense to compare between cpus as they share
the LLC, and a non-hint cpu with warmer L1/L2$ in same LLC with
the hint cpu seems more preferred.

Do you think it's appropriate or not to only hint on the hottest
LLC? So the tasks can hopefully wokenup on 'right' LLC on the
premise that wouldn't cause much imbalance between LLCs.

 > I will do some tests and return with more feedback.
 >

Find an idle CPU in the wakee's hostest LLC seems to be plausible.
The benchmark data might indicate a proper way.

thanks,
Chenyu