Re: [RFC][PATCH] sched: Cache aware load-balancing

From: Chen, Yu C
Date: Wed Mar 26 2025 - 22:49:23 EST


On 3/26/2025 5:38 PM, Peter Zijlstra wrote:
On Tue, Mar 25, 2025 at 11:19:52PM +0800, Chen, Yu C wrote:

+static void task_tick_cache(struct rq *rq, struct task_struct *p)
+{
+ struct callback_head *work = &p->cache_work;
+ struct mm_struct *mm = p->mm;
+
+ if (!mm || !mm->pcpu_sched)
+ return;
+
+ if (mm->mm_sched_epoch == rq->cpu_epoch)
+ return;
+

[1]

+ guard(raw_spinlock)(&mm->mm_sched_lock);
+
+ if (mm->mm_sched_epoch == rq->cpu_epoch)
+ return;

Remove above duplicated [1] and keep the check here?

Why? That just adds locking overhead for no reason.


I thought mm->mm_sched_epoch is updated under the protect of
mm->mm_sched_lock, so the read side should also be protected by
this lock?

+
+ if (work->next == work) {
+ task_work_add(p, work, TWA_RESUME);
+ WRITE_ONCE(mm->mm_sched_epoch, rq->cpu_epoch);
+ }
+}
+
+static void task_cache_work(struct callback_head *work)
+{
+ struct task_struct *p = current;
+ struct mm_struct *mm = p->mm;
+ unsigned long m_a_occ = 0;
+ int cpu, m_a_cpu = -1;
+ cpumask_var_t cpus;
+
+ WARN_ON_ONCE(work != &p->cache_work);
+
+ work->next = work;
+
+ if (p->flags & PF_EXITING)
+ return;
+
+ if (!alloc_cpumask_var(&cpus, GFP_KERNEL))
+ return;
+
+ scoped_guard (cpus_read_lock) {
+ cpumask_copy(cpus, cpu_online_mask);
+
+ for_each_cpu(cpu, cpus) {
+ /* XXX sched_cluster_active */
+ struct sched_domain *sd = per_cpu(sd_llc, cpu);
+ unsigned long occ, m_occ = 0, a_occ = 0;
+ int m_cpu = -1, nr = 0, i;
+
+ for_each_cpu(i, sched_domain_span(sd)) {
+ occ = fraction_mm_sched(cpu_rq(i),
+ per_cpu_ptr(mm->pcpu_sched, i));
+ a_occ += occ;
+ if (occ > m_occ) {
+ m_occ = occ;
+ m_cpu = i;
+ }
+ nr++;
+ trace_printk("(%d) occ: %ld m_occ: %ld m_cpu: %d nr: %d\n",
+ per_cpu(sd_llc_id, i), occ, m_occ, m_cpu, nr);
+ }
+
+ a_occ /= nr;


In systems with a larger number of CPUs within a single LLC, the division
may lose accuracy.

May, sure, but does it actually matter? We're tracking occupancy in ns
per EPOCH_PERIOD, which is 1e7. Lets assume 'large' here is something
daft like 100 CPUs per LLC (lets just really hope nobody actually goes
and do that, please), then you're still only loosing like 1e2 ns from
your average.


I see, the 100 ns should not matter much compared to 10 ms sample period.

Can we either use multiplication for comparison, or just use the accumulated
total CPU occupancy
of that LLC for comparison (by removing a_occ /= nr)?

You can't remove the devision, you get into trouble when the LLCs do not
have equal number of CPUs.

On a hybrid system, suppose there are two LLCs, LLC1 has 8 CPUs, the sum_occ is 1024, its avg_occ is 128. LLC2 has 4 CPUs, the sum_occ is
768, avg_occ is 192. If we compare avg_occ, we should choose LLC2, even if LLC1 might have more idle CPUs, and LLC1 has more active cache-hot threads.

You can carry that multiplication around ofc,
but again, why bother?

+ if (a_occ > m_a_occ) {
+ m_a_occ = a_occ;
+ m_a_cpu = m_cpu;
+ }
+
+ trace_printk("(%d) a_occ: %ld m_a_occ: %ld\n",
+ per_cpu(sd_llc_id, cpu), a_occ, m_a_occ);
+
+ for_each_cpu(i, sched_domain_span(sd)) {
+ /* XXX threshold ? */
+ per_cpu_ptr(mm->pcpu_sched, i)->occ = a_occ;
+ }
+
+ cpumask_andnot(cpus, cpus, sched_domain_span(sd));
+ }
+ }
+
+ /*
+ * If the max average cache occupancy is 'small' we don't care.
+ */
+ if (m_a_occ < (NICE_0_LOAD >> EPOCH_OLD))
+ m_a_cpu = -1;
+
+ mm->mm_sched_cpu = m_a_cpu;

There might be an issue with mm_sched_cpu bouncing. Perhaps adding a
threshold to compare the old a_occ of mm->mm_sched_cpu with the new a_occ of
m_a_cpu could help. For example, if the latter (new_a_occ) is twice larger
than the former (old a_occ), we can update mm->mm_sched_cpu to the new
m_a_cpu. Otherwise, we keep the old value.

Some hysteresis might make sense, but I don't think waiting for it to
double makes sense, that might be too agressive.


OK, might need to do some evaluation to figure out a reasonable threshold.

+#ifdef CONFIG_SCHED_CACHE
+static long __migrate_degrades_locality(struct task_struct *p, int src_cpu, int dst_cpu, bool idle);
+
+static int select_cache_cpu(struct task_struct *p, int prev_cpu)
+{
+ struct mm_struct *mm = p->mm;
+ int cpu;
+
+ if (!mm || p->nr_cpus_allowed == 1)
+ return prev_cpu;
+
+ cpu = mm->mm_sched_cpu;
+ if (cpu < 0)
+ return prev_cpu;
+


We observed frequent task migrations during some highly context-switch
benchmarks, which led to performance regression when the LLC was saturated.
Could we avoid task migration in cases where the previous CPU and the
preferred CPU are within the same LLC?

if (cpus_share_cache(prev_cpu, cpu))
return prev_cpu;

Sure.

+
+ if (static_branch_likely(&sched_numa_balancing) &&
+ __migrate_degrades_locality(p, prev_cpu, cpu, false) > 0) {
+ /*
+ * XXX look for max occupancy inside prev_cpu's node
+ */
+ return prev_cpu;
+ }
+

Tim found that spreading tasks within the preferred LLC might help avoid
task stacking:

sd = rcu_dereference(per_cpu(sd_llc, cpu));
if (likely(sd))
return cpumask_any(sched_domain_span(sd));

You know that cpumask_any() is implemented using cpumask_first(), right?
You're causing stacking if anything with that.


I see, this still cause stacking issue. Let me try to find other method to spread task on its preferred LLC.

@@ -9288,6 +9556,17 @@ static int task_hot(struct task_struct *p, struct lb_env *env)
if (sysctl_sched_migration_cost == 0)
return 0;
+#ifdef CONFIG_SCHED_CACHE
+ if (p->mm && p->mm->pcpu_sched) {
+ /*
+ * XXX things like Skylake have non-inclusive L3 and might not
+ * like this L3 centric view. What to do about L2 stickyness ?
+ */
+ return per_cpu_ptr(p->mm->pcpu_sched, env->src_cpu)->occ >
+ per_cpu_ptr(p->mm->pcpu_sched, env->dst_cpu)->occ;

This might encourage more task migration within the preferred LLC, leading
to some performance regression. Is it possible to raise the threshold for
task migration within the preferred LLC and use the original delta time to
determine whether a task is cache-hot?

if (p->mm && p->mm->pcpu_sched &&
cpus_share_cache(env->dst_cpu, env->src_cpu))
return 2*per_cpu_ptr(p->mm->pcpu_sched, env->src_cpu)->occ >
per_cpu_ptr(p->mm->pcpu_sched, env->dst_cpu)->occ;
}

After a second thought, my change might be incorrect. Because every CPU in the same LLC should have the same percpu occ.

> > Nah, the saner thing to do is to preserve the topology averages and look
at those instead of the per-cpu values.

Eg. have task_cache_work() compute and store averages in the
sched_domain structure and then use those.
Sorry I did not quite catch up with this, doesn't the
per_cpu_ptr(p->mm->pcpu_sched, cpu)->occ
already have the average occupancy of the whole LLC domain? Every
CPU in the same LLC should have the same average occ because in
task_cache_work():
for_each_cpu(i, sched_domain_span(sd)) {
/* XXX threshold ? */
per_cpu_ptr(mm->pcpu_sched, i)->occ = a_occ;
}


If the goal is to avoid migrating task to its non-preferred LLC during
load balance, maybe we can check if:
1. env->src_cpu is in p's preferred LLC already, and
2. env->dst_cpu is not in p's preferred LLC, and
3. p's preferred LLC is not overloaded,we should treat p as cache hot.

The definition of preferred LLC of a task is something like:
p->mm->mm_preferred_llc = llc_id(p->mm->mm_sched_cpu)


I'll add some trace/schedstat on top of your patch to investigate.

thanks,
Chenyu

thanks,
Chenyu