Re: [PATCH] sched/fair: Skip cpus with no sched domain attached during NOHZ idle balance

From: Pierre Gondois
Date: Thu Sep 14 2023 - 10:53:21 EST




On 9/14/23 11:23, Zhang, Rui wrote:
Hi, Pierre,


Yes right indeed,
This happens when putting a CPU offline (as you mentioned earlier,
putting a CPU offline clears the CPU in the idle_cpus_mask).

The load balancing related variables

including?

I meant the nohz idle variables in the load balancing, so I was referring to:
(struct sched_domain_shared).nr_busy_cpus
(struct sched_domain).nohz_idle
nohz.idle_cpus_mask
nohz.nr_cpus
(struct rq).nohz_tick_stopped


are unused if a CPU has a NULL
rq as it cannot pull any task. Ideally we should clear them once,
when attaching a NULL sd to the CPU.

This sounds good to me. But TBH, I don't have enough confidence to do
so because I'm not crystal clear about how these variables are used.

Some questions about the code below.

The following snipped should do that and solve the issue you
mentioned:
--- snip ---
--- a/include/linux/sched/nohz.h
+++ b/include/linux/sched/nohz.h
@@ -9,8 +9,10 @@
  #if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ_COMMON)
  extern void nohz_balance_enter_idle(int cpu);
  extern int get_nohz_timer_target(void);
+extern void nohz_clean_sd_state(int cpu);
  #else
  static inline void nohz_balance_enter_idle(int cpu) { }
+static inline void nohz_clean_sd_state(int cpu) { }
  #endif
  #ifdef CONFIG_NO_HZ_COMMON
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b3e25be58e2b..6fcabe5d08f5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -11525,6 +11525,9 @@ void nohz_balance_exit_idle(struct rq *rq)
  {
         SCHED_WARN_ON(rq != this_rq());
+       if (on_null_domain(rq))
+               return;
+
         if (likely(!rq->nohz_tick_stopped))
                 return;

if we force clearing rq->nohz_tick_stopped when detaching domain, why
bother adding the first check?

Yes you're right. I added this check for safety, but this is not
mandatory.


@@ -11551,6 +11554,17 @@ static void set_cpu_sd_state_idle(int cpu)
         rcu_read_unlock();
  }
+void nohz_clean_sd_state(int cpu) {
+       struct rq *rq = cpu_rq(cpu);
+
+       rq->nohz_tick_stopped = 0;
+       if (cpumask_test_cpu(cpu, nohz.idle_cpus_mask)) {
+               cpumask_clear_cpu(cpu, nohz.idle_cpus_mask);
+               atomic_dec(&nohz.nr_cpus);
+       }
+       set_cpu_sd_state_idle(cpu);
+}
+

detach_destroy_domains
cpu_attach_domain
update_top_cache_domain

as we clears per_cpu(sd_llc, cpu) for the isolated cpu in
cpu_attach_domain(), set_cpu_sd_state_idle() seems to be a no-op here,
no?

Yes you're right, cpu_attach_domain() and nohz_clean_sd_state() calls
have to be inverted to avoid what you just described.

It also seems that the current kernel doesn't decrease nr_busy_cpus
when putting CPUs in an isolated partition. Indeed if a CPU is counted
in nr_busy_cpus, putting the CPU in an isolated partition doesn't trigger
any call to set_cpu_sd_state_idle().
So it might an additional argument.

Thanks for reading the patch,
Regards,
Pierre


thanks,
rui
  /*
   * This routine will record that the CPU is going idle with tick
stopped.
   * This info will be used in performing idle load balancing in the
future.
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index d3a3b2646ec4..d31137b5f0ce 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -2584,8 +2584,10 @@ static void detach_destroy_domains(const
struct cpumask *cpu_map)
static_branch_dec_cpuslocked(&sched_asym_cpucapacity);
         rcu_read_lock();
-       for_each_cpu(i, cpu_map)
+       for_each_cpu(i, cpu_map) {
                 cpu_attach_domain(NULL, &def_root_domain, i);
+               nohz_clean_sd_state(i);
+       }
         rcu_read_unlock();
  }

--- snip ---

Regards,
Pierre



+       }
+
          /*
           * The tick is still stopped but load could have been
added in the
           * meantime. We set the nohz.has_blocked flag to trig
a
check of the
@@ -11585,10 +11609,6 @@ void nohz_balance_enter_idle(int cpu)
          if (rq->nohz_tick_stopped)
                  goto out;
-       /* If we're a completely isolated CPU, we don't play:
*/
-       if (on_null_domain(rq))
-               return;
-
          rq->nohz_tick_stopped = 1;
          cpumask_set_cpu(cpu, nohz.idle_cpus_mask);

Otherwise I could reproduce the issue and the patch was solving
it,
so:
Tested-by: Pierre Gondois <pierre.gondois@xxxxxxx>

Thanks for testing, really appreciated!





Also, your patch doesn't aim to solve that, but I think there
is an
issue
when updating cpuset.cpus when an isolated partition was
already
created:

// Create an isolated partition containing CPU0
# mkdir cgroup
# mount -t cgroup2 none cgroup/
# mkdir cgroup/Testing
# echo "+cpuset" > cgroup/cgroup.subtree_control
# echo "+cpuset" > cgroup/Testing/cgroup.subtree_control
# echo 0 > cgroup/Testing/cpuset.cpus
# echo isolated > cgroup/Testing/cpuset.cpus.partition

// CPU0's sched domain is detached:
# ls /sys/kernel/debug/sched/domains/cpu0/
# ls /sys/kernel/debug/sched/domains/cpu1/
domain0  domain1

// Change the isolated partition to be CPU1
# echo 1 > cgroup/Testing/cpuset.cpus

// CPU[0-1] sched domains are not updated:
# ls /sys/kernel/debug/sched/domains/cpu0/
# ls /sys/kernel/debug/sched/domains/cpu1/
domain0  domain1

Interesting. Let me check and get back to you later on this. :)

thanks,
rui