Re: [PATCH v3 6/8] sched/idle: Move busy_cpu accounting to idle callback
From: Srikar Dronamraju
Date: Fri May 21 2021 - 09:22:02 EST
* Vincent Guittot <vincent.guittot@xxxxxxxxxx> [2021-05-21 14:37:51]:
> On Thu, 13 May 2021 at 09:41, Srikar Dronamraju
> <srikar@xxxxxxxxxxxxxxxxxx> wrote:
> >
> > Currently we account nr_busy_cpus in no_hz idle functions.
> > There is no reason why nr_busy_cpus should updated be in NO_HZ_COMMON
> > configs only. Also scheduler can mark a CPU as non-busy as soon as an
> > idle class task starts to run. Scheduler can then mark a CPU as busy
> > as soon as its woken up from idle or a new task is placed on it's
> > runqueue.
> >
> > Cc: LKML <linux-kernel@xxxxxxxxxxxxxxx>
> > Cc: Gautham R Shenoy <ego@xxxxxxxxxxxxxxxxxx>
> > Cc: Parth Shah <parth@xxxxxxxxxxxxx>
> > Cc: Ingo Molnar <mingo@xxxxxxxxxx>
> > Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
> > Cc: Valentin Schneider <valentin.schneider@xxxxxxx>
> > Cc: Dietmar Eggemann <dietmar.eggemann@xxxxxxx>
> > Cc: Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx>
> > Cc: Vincent Guittot <vincent.guittot@xxxxxxxxxx>
> > Cc: Rik van Riel <riel@xxxxxxxxxxx>
> > Cc: Aubrey Li <aubrey.li@xxxxxxxxxxxxxxx>
> > Signed-off-by: Srikar Dronamraju <srikar@xxxxxxxxxxxxxxxxxx>
> > ---
> > kernel/sched/fair.c | 6 ++++--
> > kernel/sched/idle.c | 29 +++++++++++++++++++++++++++--
> > kernel/sched/sched.h | 1 +
> > kernel/sched/topology.c | 2 ++
> > 4 files changed, 34 insertions(+), 4 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 0dfe01de22d6..8f86359efdbd 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -10410,7 +10410,10 @@ static void set_cpu_sd_state_busy(int cpu)
> > goto unlock;
> > sd->nohz_idle = 0;
> >
> > - atomic_inc(&sd->shared->nr_busy_cpus);
> > + if (sd && per_cpu(is_idle, cpu)) {
> > + atomic_add_unless(&sd->shared->nr_busy_cpus, 1, per_cpu(sd_llc_size, cpu));
> > + per_cpu(is_idle, cpu) = 0;
> > + }
> > unlock:
> > rcu_read_unlock();
> > }
> > @@ -10440,7 +10443,6 @@ static void set_cpu_sd_state_idle(int cpu)
> > goto unlock;
> > sd->nohz_idle = 1;
> >
> > - atomic_dec(&sd->shared->nr_busy_cpus);
> > unlock:
> > rcu_read_unlock();
> > }
> > diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> > index a9f5a8ace59e..c13105fe06b3 100644
> > --- a/kernel/sched/idle.c
> > +++ b/kernel/sched/idle.c
> > @@ -431,12 +431,25 @@ static void check_preempt_curr_idle(struct rq *rq, struct task_struct *p, int fl
> >
> > static void put_prev_task_idle(struct rq *rq, struct task_struct *prev)
> > {
> > -#ifdef CONFIG_SCHED_SMT
> > +#ifdef CONFIG_SMP
> > + struct sched_domain_shared *sds;
> > int cpu = rq->cpu;
> >
> > +#ifdef CONFIG_SCHED_SMT
> > if (static_branch_likely(&sched_smt_present))
> > set_core_busy(cpu);
> > #endif
> > +
> > + rcu_read_lock();
> > + sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
> > + if (sds) {
> > + if (per_cpu(is_idle, cpu)) {
> > + atomic_inc(&sds->nr_busy_cpus);
> > + per_cpu(is_idle, cpu) = 0;
> > + }
> > + }
> > + rcu_read_unlock();
> > +#endif
> > }
> >
> > static void set_next_task_idle(struct rq *rq, struct task_struct *next, bool first)
> > @@ -448,9 +461,21 @@ static void set_next_task_idle(struct rq *rq, struct task_struct *next, bool fir
> > struct task_struct *pick_next_task_idle(struct rq *rq)
> > {
> > struct task_struct *next = rq->idle;
> > +#ifdef CONFIG_SMP
> > + struct sched_domain_shared *sds;
> > + int cpu = rq->cpu;
> >
> > - set_next_task_idle(rq, next, true);
> > + rcu_read_lock();
> > + sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
> > + if (sds) {
> > + atomic_add_unless(&sds->nr_busy_cpus, -1, 0);
> > + per_cpu(is_idle, cpu) = 1;
> > + }
>
> One reason to update nr_busy_cpus only during tick is and not at each
> and every single sleep/wakeup to limit the number of atomic_inc/dec in
> case of storm of short running tasks. Because at the end , you waste
> more time trying to accurately follow the current state of the CPU
> than doing work
>
Yes, I do understand that for short running tasks or if the CPUs are
entering idle for a very short interval; we are unnecessarily tracking the
number of busy_cpus.
However lets assume we have to compare 2 LLCs and have to choose a better
one for a wakeup.
1. We can look at nr_busy_cpus which may not have been updated at every
CPU idle.
2. We can look at nr_busy_cpus which has been updated at every CPU idle.
3. We start aggregating the load of all the CPUs in the LLC.
4. Use the current method, where it only compares the load on previous CPU
and current CPU. However that doesnt give too much indication if the other
CPUs in those LLCs were free.
or probably some other method.
I thought option 2 would be better but I am okay with option 1 too.
Please let me know what option you would prefer.
> >
> > + rcu_read_unlock();
> > +#endif
> > +
> > + set_next_task_idle(rq, next, true);
> > return next;
> > }
> >
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index 98c3cfbc5d26..b66c4dad5fd2 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -1496,6 +1496,7 @@ DECLARE_PER_CPU(int, sd_llc_id);
> > #ifdef CONFIG_SCHED_SMT
> > DECLARE_PER_CPU(int, smt_id);
> > #endif
> > +DECLARE_PER_CPU(int, is_idle);
> > DECLARE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
> > DECLARE_PER_CPU(struct sched_domain __rcu *, sd_numa);
> > DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
> > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> > index 232fb261dfc2..730252937712 100644
> > --- a/kernel/sched/topology.c
> > +++ b/kernel/sched/topology.c
> > @@ -647,6 +647,7 @@ DEFINE_PER_CPU(int, sd_llc_id);
> > #ifdef CONFIG_SCHED_SMT
> > DEFINE_PER_CPU(int, smt_id);
> > #endif
> > +DEFINE_PER_CPU(int, is_idle);
> > DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
> > DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa);
> > DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
> > @@ -673,6 +674,7 @@ static void update_top_cache_domain(int cpu)
> > #ifdef CONFIG_SCHED_SMT
> > per_cpu(smt_id, cpu) = cpumask_first(cpu_smt_mask(cpu));
> > #endif
> > + per_cpu(is_idle, cpu) = 1;
> > rcu_assign_pointer(per_cpu(sd_llc_shared, cpu), sds);
> >
> > sd = lowest_flag_domain(cpu, SD_NUMA);
> > --
> > 2.18.2
> >
--
Thanks and Regards
Srikar Dronamraju