Re: [PATCH 3/3] sched/fair: Ensure select housekeeping cpus in task_numa_find_cpu

From: K Prateek Nayak
Date: Thu Dec 26 2024 - 23:41:17 EST


Hello Chuyi,

On 12/23/2024 6:28 PM, Chuyi Zhou wrote:


在 2024/12/18 14:21, K Prateek Nayak 写道:
Hello Chuyi,

On 12/16/2024 5:53 PM, Chuyi Zhou wrote:
[..snip..]
@@ -2081,6 +2081,12 @@ numa_type numa_classify(unsigned int imbalance_pct,
      return node_fully_busy;
  }
+static inline bool numa_migrate_test_cpu(struct task_struct *p, int cpu)
+{
+    return cpumask_test_cpu(cpu, p->cpus_ptr) &&
+            housekeeping_cpu(cpu, HK_TYPE_DOMAIN);
+}
+
  #ifdef CONFIG_SCHED_SMT
  /* Forward declarations of select_idle_sibling helpers */
  static inline bool test_idle_cores(int cpu);
@@ -2168,7 +2174,7 @@ static void task_numa_assign(struct task_numa_env *env,
          /* Find alternative idle CPU. */
          for_each_cpu_wrap(cpu, cpumask_of_node(env->dst_nid), start + 1) {

Can we just do:

     for_each_cpu_and(cpu, cpumask_of_node(env->dst_nid), housekeeping_cpumask(HK_TYPE_DOMAIN)) {
         ...
     }

and avoid adding numa_migrate_test_cpu(). Thoughts?

Make sense, but now there doesn't seem to be an API like for_each_cpu_wrap_and().

Do you think the following is better?

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 855df103f4dd..4792ef672738 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2167,9 +2167,9 @@ static void task_numa_assign(struct task_numa_env *env,
                int start = env->dst_cpu;

                /* Find alternative idle CPU. */
-               for_each_cpu_wrap(cpu, cpumask_of_node(env->dst_nid), start + 1) {
+               for_each_cpu_and(cpu, cpumask_of_node(env->dst_nid), housekeeping_cpumask(HK_TYPE_DOMAIN)) {
                        if (cpu == env->best_cpu || !idle_cpu(cpu) ||

"start" is set to "env->dst_cpu" is already taken care here with the
first comparison.

-                           !cpumask_test_cpu(cpu, env->p->cpus_ptr)) {
+                               cpu == start || !cpumask_test_cpu(cpu, env->p->cpus_ptr)) {
                                continue;
                        }


I think the for_each_cpu_wrap() was used to reduce contention for xchg
operation below. Perhaps we can have a per-cpu temporary mask (like
load_balance_mask) if we want to reduce the xchg contention and break
this into cpumask_and() + for_each_cpu_wrap() steps. I'm not sure if
any of the existing masks (load_balance_mask, select_rq_mask,
should_we_balance_tmpmask) can be safely reused. Otherwise, perhaps we
can make a case for for_each_cpu_and_wrap() with this use case.


Thanks.



              if (cpu == env->best_cpu || !idle_cpu(cpu) ||
-                !cpumask_test_cpu(cpu, env->p->cpus_ptr)) {
+                !numa_migrate_test_cpu(env->p, cpu)) {
                  continue;
              }
@@ -2480,7 +2486,7 @@ static void task_numa_find_cpu(struct task_numa_env *env,
      for_each_cpu(cpu, cpumask_of_node(env->dst_nid)) {

Same modifications can be made for this outer loop.



--
Thanks and Regards,
Prateek