Re: [PATCH v3] sched/deadline: Fix sched_dl_global_validate()

From: Peng Liu
Date: Thu Sep 24 2020 - 11:59:16 EST


Hi Peter,

Thanks for looking at this.

On Thu, Sep 24, 2020 at 12:57:14PM +0200, Peter Zijlstra wrote:
> On Sat, Sep 19, 2020 at 09:42:49AM +0800, Peng Liu wrote:
> > When user changes sched_rt_{runtime, period}_us, then
> >
> > sched_rt_handler()
> > --> sched_dl_bandwidth_validate()
> > {
> > new_bw = global_rt_runtime()/global_rt_period();
> >
> > for_each_possible_cpu(cpu) {
> > dl_b = dl_bw_of(cpu);
> > if (new_bw < dl_b->total_bw)
> > ret = -EBUSY;
> > }
> > }
> >
> > Under CONFIG_SMP, dl_bw is per root domain , but not per CPU,
> > dl_b->total_bw is the allocated bandwidth of the whole root domain.
> > we should compare dl_b->total_bw against cpus*new_bw, where 'cpus'
> > is the number of CPUs of the root domain.
>
> Is there an actual problem there? Spell it out.

I created another root domain(contains 2 CPUs) besides the default
one, and the global default rt bandwidth is 95%. Then I launched a
DL process which need 25% bandwidth and moved it to the new root
domain, so far so good.

Then I tried to change global rt bandwidth to 20% with cmd:
echo 200000 > /proc/sys/kernel/sched_rt_runtime_us
but ending with the "device busy" error. Only values greater than
250000 could work.

The new root domain contains two CPUs, thus should could provide
totally 2*20%(>25%) bandwidth. So the error is strange.

Finally I found it is the sched_dl_global_validate() mistakenly
do the validation. It doesn't multiply the root domain weight.

way to reproduce:
cd /sys/fs/cgroup/cpuset/
mkdir cluster
echo 0 > cpuset.sched_load_balance
cd cluster
echo 10-11 > cpuset.cpus
echo 0 > cpuset.mems
echo 1 > cpuset.cpu_exclusive
echo pid-of-dl25 > tasks

> >
> > [1] https://lkml.org/lkml/2010/2/28/119
>
> Don't use lkml.org links, use lkml.kernel.org/r/$MsgID instead.

OK, I will.

>
> > [!CONFIG_SMP build error]
> > Reported-by: kernel test robot <lkp@xxxxxxxxx>
> > Signed-off-by: Peng Liu <iwtbavbm@xxxxxxxxx>
>
> Quite frankly this patch is horrible #ifdef soup.

Frankly speaking, I also hate the ugly #ifdef guys, but I have no
idea how to eliminate them until seeing your method. Indeed, quite
clear. I will refine the patch according your suggestion. Thanks.

>
> Can't you make something like the below work?
>
> ---
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index 3862a28cd05d..3f309e0f69f5 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -97,6 +97,17 @@ static inline unsigned long dl_bw_capacity(int i)
> return __dl_bw_capacity(i);
> }
> }
> +
> +static inline bool dl_bw_visited(int cpu, u64 gen)
> +{
> + struct root_domain *rd = cpu_rq(i)->rd;
> +
> + if (rd->visit_gen == gen)
> + return true;
> +
> + rd->visit_gen = gen;
> + return false;
> +}
> #else
> static inline struct dl_bw *dl_bw_of(int i)
> {
> @@ -112,6 +123,11 @@ static inline unsigned long dl_bw_capacity(int i)
> {
> return SCHED_CAPACITY_SCALE;
> }
> +
> +static inline bool dl_bw_visited(int cpu, u64 gen)
> +{
> + return false;
> +}
> #endif
>
> static inline
> @@ -2513,31 +2529,35 @@ const struct sched_class dl_sched_class
>
> int sched_dl_global_validate(void)
> {
> + static u64 generation = 0;
> u64 runtime = global_rt_runtime();
> u64 period = global_rt_period();
> u64 new_bw = to_ratio(period, runtime);
> - struct dl_bw *dl_b;
> - int cpu, ret = 0;
> + int cpu, cpus, ret = 0;
> unsigned long flags;
> + struct dl_bw *dl_b;
> + u64 gen = ++generation;
>
> /*
> * Here we want to check the bandwidth not being set to some
> * value smaller than the currently allocated bandwidth in
> * any of the root_domains.
> - *
> - * FIXME: Cycling on all the CPUs is overdoing, but simpler than
> - * cycling on root_domains... Discussion on different/better
> - * solutions is welcome!
> */
> for_each_possible_cpu(cpu) {
> +
> rcu_read_lock_sched();
> + if (dl_bw_visited(cpu, gen))
> + goto next;
> +
> dl_b = dl_bw_of(cpu);
> + cpus = dl_bw_cpus(cpu);
>
> raw_spin_lock_irqsave(&dl_b->lock, flags);
> - if (new_bw < dl_b->total_bw)
> + if (new_bw * cpus < dl_b->total_bw)
> ret = -EBUSY;
> raw_spin_unlock_irqrestore(&dl_b->lock, flags);
>
> + next:
> rcu_read_unlock_sched();
>
> if (ret)
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 28709f6b0975..7f0947db6e2c 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -801,6 +801,8 @@ struct root_domain {
> struct dl_bw dl_bw;
> struct cpudl cpudl;
>
> + u64 visit_gen;
> +
> #ifdef HAVE_RT_PUSH_IPI
> /*
> * For IPI pull requests, loop across the rto_mask.