Re: [PATCH] sched/fair: Skip cpus with no sched domain attached during NOHZ idle balance
From: Zhang, Rui
Date: Thu Nov 16 2023 - 02:32:06 EST
Hi, Vincent,
Really appreciate your comments on this.
On Wed, 2023-11-15 at 21:01 +0100, Vincent Guittot wrote:
> Hi Rui,
>
> On Wed, 20 Sept 2023 at 09:24, Zhang, Rui <rui.zhang@xxxxxxxxx>
> wrote:
> >
> > Hi, Pierre,
> >
> > Sorry for the late response. I'm still ramping up on the related
> > code.
> >
> > On Thu, 2023-09-14 at 16:53 +0200, Pierre Gondois wrote:
> > >
> > >
> > > On 9/14/23 11:23, Zhang, Rui wrote:
> > > > Hi, Pierre,
> > > >
> > > > >
> > > > > Yes right indeed,
> > > > > This happens when putting a CPU offline (as you mentioned
> > > > > earlier,
> > > > > putting a CPU offline clears the CPU in the idle_cpus_mask).
> > > > >
> > > > > The load balancing related variables
> > > >
> > > > including?
> > >
> > > I meant the nohz idle variables in the load balancing, so I was
> > > referring to:
> > > (struct sched_domain_shared).nr_busy_cpus
> > > (struct sched_domain).nohz_idle
> > > nohz.idle_cpus_mask
> > > nohz.nr_cpus
> > > (struct rq).nohz_tick_stopped
> >
> > IMO, the problem is that, for an isolated CPU,
> > 1. it is not an idle cpu (nohz.idle_cpus_mask should be cleared)
> > 2. it is not a busy cpu (sds->nr_busy_cpus should be decreased)
> >
> > But current code does not have a third state to describe this, so
> > we
> > need to either
> > 1. add extra logic, like on_null_domain() checks
> > or
> > 2. rely on current logic, but update all related variables
> > correctly,
> > like you proposed.
>
> Isn't the housekeeping cpu mask there to manage such a case ?
This is true for isolated CPUs using boot option "nohz_full=".
But for CPUs in the cgroup isolated partition, the housekeeping cpumask
is not updated.
I don't know if this is intended or not.
> I was
> expecting that your isolated cpu should be cleared from the
> housekeeping cpumask used by scheduler and ILB
This patch is a direct fix when I found the isolated CPUs are woke up
by this piece of code.
>
> I think that your solution is the comment of the ffind_new_ilb()
> unction:
> "
> * - HK_TYPE_MISC CPUs are used for this task, because HK_TYPE_SCHED
> is not set
> * anywhere yet.
> "
>
> IMO, you should look at enabling and using the HK_TYPE_SCHED for
> isolated CPU
yeah, this seems reasonable.
I'm new to cgroup and I'm not sure what should be the proper behavior
for CPUs in isolated partition.
>
> CCed Frederic to get his opinion
Thanks.
-rui
>
> > But in any case, we should stick with one direction.
> >
> > If we follow the first one, the original patch should be used,
> > which
> > IMO is simple and straight forward.
> > If we follow the later one, we'd better audit and remove the
> > current
> > on_null_domain() usage at the same time. TBH, I'm not confident
> > enough
> > to make such a change. But if you want to propose something, I'd
> > glad
> > to test it.
> >
> > thanks,
> > rui
> >
> > >
> > > >
> > > > > are unused if a CPU has a NULL
> > > > > rq as it cannot pull any task. Ideally we should clear them
> > > > > once,
> > > > > when attaching a NULL sd to the CPU.
> > > >
> > > > This sounds good to me. But TBH, I don't have enough confidence
> > > > to
> > > > do
> > > > so because I'm not crystal clear about how these variables are
> > > > used.
> > > >
> > > > Some questions about the code below.
> > > > >
> > > > > The following snipped should do that and solve the issue you
> > > > > mentioned:
> > > > > --- snip ---
> > > > > --- a/include/linux/sched/nohz.h
> > > > > +++ b/include/linux/sched/nohz.h
> > > > > @@ -9,8 +9,10 @@
> > > > > #if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ_COMMON)
> > > > > extern void nohz_balance_enter_idle(int cpu);
> > > > > extern int get_nohz_timer_target(void);
> > > > > +extern void nohz_clean_sd_state(int cpu);
> > > > > #else
> > > > > static inline void nohz_balance_enter_idle(int cpu) { }
> > > > > +static inline void nohz_clean_sd_state(int cpu) { }
> > > > > #endif
> > > > >
> > > > > #ifdef CONFIG_NO_HZ_COMMON
> > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > > > index b3e25be58e2b..6fcabe5d08f5 100644
> > > > > --- a/kernel/sched/fair.c
> > > > > +++ b/kernel/sched/fair.c
> > > > > @@ -11525,6 +11525,9 @@ void nohz_balance_exit_idle(struct rq
> > > > > *rq)
> > > > > {
> > > > > SCHED_WARN_ON(rq != this_rq());
> > > > >
> > > > > + if (on_null_domain(rq))
> > > > > + return;
> > > > > +
> > > > > if (likely(!rq->nohz_tick_stopped))
> > > > > return;
> > > > >
> > > > if we force clearing rq->nohz_tick_stopped when detaching
> > > > domain,
> > > > why
> > > > bother adding the first check?
> > >
> > > Yes you're right. I added this check for safety, but this is not
> > > mandatory.
> > >
> > > >
> > > > >
> > > > > @@ -11551,6 +11554,17 @@ static void
> > > > > set_cpu_sd_state_idle(int
> > > > > cpu)
> > > > > rcu_read_unlock();
> > > > > }
> > > > >
> > > > > +void nohz_clean_sd_state(int cpu) {
> > > > > + struct rq *rq = cpu_rq(cpu);
> > > > > +
> > > > > + rq->nohz_tick_stopped = 0;
> > > > > + if (cpumask_test_cpu(cpu, nohz.idle_cpus_mask)) {
> > > > > + cpumask_clear_cpu(cpu, nohz.idle_cpus_mask);
> > > > > + atomic_dec(&nohz.nr_cpus);
> > > > > + }
> > > > > + set_cpu_sd_state_idle(cpu);
> > > > > +}
> > > > > +
> > > >
> > > > detach_destroy_domains
> > > > cpu_attach_domain
> > > > update_top_cache_domain
> > > >
> > > > as we clears per_cpu(sd_llc, cpu) for the isolated cpu in
> > > > cpu_attach_domain(), set_cpu_sd_state_idle() seems to be a no-
> > > > op
> > > > here,
> > > > no?
> > >
> > > Yes you're right, cpu_attach_domain() and nohz_clean_sd_state()
> > > calls
> > > have to be inverted to avoid what you just described.
> > >
> > > It also seems that the current kernel doesn't decrease
> > > nr_busy_cpus
> > > when putting CPUs in an isolated partition. Indeed if a CPU is
> > > counted
> > > in nr_busy_cpus, putting the CPU in an isolated partition doesn't
> > > trigger
> > > any call to set_cpu_sd_state_idle().
> > > So it might an additional argument.
> > >
> > > Thanks for reading the patch,
> > > Regards,
> > > Pierre
> > >
> > > >
> > > > thanks,
> > > > rui
> > > > > /*
> > > > > * This routine will record that the CPU is going idle
> > > > > with
> > > > > tick
> > > > > stopped.
> > > > > * This info will be used in performing idle load
> > > > > balancing in
> > > > > the
> > > > > future.
> > > > > diff --git a/kernel/sched/topology.c
> > > > > b/kernel/sched/topology.c
> > > > > index d3a3b2646ec4..d31137b5f0ce 100644
> > > > > --- a/kernel/sched/topology.c
> > > > > +++ b/kernel/sched/topology.c
> > > > > @@ -2584,8 +2584,10 @@ static void
> > > > > detach_destroy_domains(const
> > > > > struct cpumask *cpu_map)
> > > > >
> > > > > static_branch_dec_cpuslocked(&sched_asym_cpucapacity);
> > > > >
> > > > > rcu_read_lock();
> > > > > - for_each_cpu(i, cpu_map)
> > > > > + for_each_cpu(i, cpu_map) {
> > > > > cpu_attach_domain(NULL, &def_root_domain,
> > > > > i);
> > > > > + nohz_clean_sd_state(i);
> > > > > + }
> > > > > rcu_read_unlock();
> > > > > }
> > > > >
> > > > > --- snip ---
> > > > >
> > > > > Regards,
> > > > > Pierre
> > > > >
> > > > > >
> > > > > > >
> > > > > > > > + }
> > > > > > > > +
> > > > > > > > /*
> > > > > > > > * The tick is still stopped but load could
> > > > > > > > have
> > > > > > > > been
> > > > > > > > added in the
> > > > > > > > * meantime. We set the nohz.has_blocked
> > > > > > > > flag to
> > > > > > > > trig
> > > > > > > > a
> > > > > > > > check of the
> > > > > > > > @@ -11585,10 +11609,6 @@ void
> > > > > > > > nohz_balance_enter_idle(int
> > > > > > > > cpu)
> > > > > > > > if (rq->nohz_tick_stopped)
> > > > > > > > goto out;
> > > > > > > > - /* If we're a completely isolated CPU, we don't
> > > > > > > > play:
> > > > > > > > */
> > > > > > > > - if (on_null_domain(rq))
> > > > > > > > - return;
> > > > > > > > -
> > > > > > > > rq->nohz_tick_stopped = 1;
> > > > > > > > cpumask_set_cpu(cpu, nohz.idle_cpus_mask);
> > > > > > > >
> > > > > > > > Otherwise I could reproduce the issue and the patch was
> > > > > > > > solving
> > > > > > > > it,
> > > > > > > > so:
> > > > > > > > Tested-by: Pierre Gondois <pierre.gondois@xxxxxxx>
> > > > > >
> > > > > > Thanks for testing, really appreciated!
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > Also, your patch doesn't aim to solve that, but I think
> > > > > > > > there
> > > > > > > > is an
> > > > > > > > issue
> > > > > > > > when updating cpuset.cpus when an isolated partition
> > > > > > > > was
> > > > > > > > already
> > > > > > > > created:
> > > > > > > >
> > > > > > > > // Create an isolated partition containing CPU0
> > > > > > > > # mkdir cgroup
> > > > > > > > # mount -t cgroup2 none cgroup/
> > > > > > > > # mkdir cgroup/Testing
> > > > > > > > # echo "+cpuset" > cgroup/cgroup.subtree_control
> > > > > > > > # echo "+cpuset" >
> > > > > > > > cgroup/Testing/cgroup.subtree_control
> > > > > > > > # echo 0 > cgroup/Testing/cpuset.cpus
> > > > > > > > # echo isolated > cgroup/Testing/cpuset.cpus.partition
> > > > > > > >
> > > > > > > > // CPU0's sched domain is detached:
> > > > > > > > # ls /sys/kernel/debug/sched/domains/cpu0/
> > > > > > > > # ls /sys/kernel/debug/sched/domains/cpu1/
> > > > > > > > domain0 domain1
> > > > > > > >
> > > > > > > > // Change the isolated partition to be CPU1
> > > > > > > > # echo 1 > cgroup/Testing/cpuset.cpus
> > > > > > > >
> > > > > > > > // CPU[0-1] sched domains are not updated:
> > > > > > > > # ls /sys/kernel/debug/sched/domains/cpu0/
> > > > > > > > # ls /sys/kernel/debug/sched/domains/cpu1/
> > > > > > > > domain0 domain1
> > > > > > > >
> > > > > > Interesting. Let me check and get back to you later on
> > > > > > this. :)
> > > > > >
> > > > > > thanks,
> > > > > > rui
> > > >
> >