Re: [PATCH] sched/fair: Use this_sd->weight to calculate span_avg

From: Vincent Guittot
Date: Tue Oct 12 2021 - 03:11:37 EST


On Mon, 11 Oct 2021 at 18:45, Tao Zhou <tao.zhou@xxxxxxxxx> wrote:
>
> Hi Steven,
>
> On Mon, Oct 11, 2021 at 10:58:02AM -0400, Steven Rostedt wrote:
> > On Sun, 10 Oct 2021 02:10:55 +0800
> > Tao Zhou <tao.zhou@xxxxxxxxx> wrote:
> >
> > > avg_idle, avg_cost got from this_rq and this_sd. I think
> > > use this_sd->weight to calculate and estimate the number
> > > of loop cpus in the target domain.
> >
> > If that's the case, then shouldn't the CPUs to be checked come from this_sd
> > as well? I mean, at the beginning of the function we have:
> >
> > this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
> > if (!this_sd)
> > return -1;
> >
> > cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> >
> > Where "cpus" comes from sd, and not this_sd.
>
> Thank you for reply.
>
>
> Cycles are spent on this CPU(and in this Domain) if I am not wrong.
> If T1(on this CPU) want to try to wake up T2 on another CPU, Kernel
> (on this CPU) should evaluate that not spend much time in finding
> idle siblings. And the avg_idle and span_avg are two fators to compare
> with. avg_idle is the CPU average idle time of this rq and avg_cost
> is the last time this Domain was being looped and the recorded avg
> cost time. Two values are history averaged. So, use the history to
> evaluate future that we do *now* the another Domain cpu search to not
> let this CPU(and in this Domain) too busy. This is what I want to say.
> Not sure yet.

this_rq->wake_avg_idle is used to guest estimate how much busy is this_cpu
this_sd->avg_scan_cost is used to estimate how much time was spent in
average looking at an idle cpu at this domain level on this cpu
and sd->span_weight is is used to take into account the size of the
target domain on which it is going to loop. But we don't really care
about the size of the domain of the this_cpu

>
> > > ---
> > > kernel/sched/fair.c | 2 +-
> > > 1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index f6a05d9b5443..7fab7b70814c 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -6300,7 +6300,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
> > > avg_idle = this_rq->wake_avg_idle;
> > > avg_cost = this_sd->avg_scan_cost + 1;
> > >
> > > - span_avg = sd->span_weight * avg_idle;
> > > + span_avg = this_sd->span_weight * avg_idle;
> > > if (span_avg > 4*avg_cost)
> > > nr = div_u64(span_avg, avg_cost);
> > > else
> >
> >
> > And after this code, the nr that is determined from the above, is for
> > limiting the looping over those CPUs from sd, not this_sd:
> >
> > for_each_cpu_wrap(cpu, cpus, target + 1) {
> > if (has_idle_core) {
> > i = select_idle_core(p, cpu, cpus, &idle_cpu);
> > if ((unsigned int)i < nr_cpumask_bits)
> > return i;
> >
> > } else {
> > if (!--nr)
> > return -1;
> > idle_cpu = __select_idle_cpu(cpu, p);
> > if ((unsigned int)idle_cpu < nr_cpumask_bits)
> > break;
> > }
> > }
> >
> > I'm guessing there's nothing wrong here. But, I don't fully understand it
> > myself.
>
> I replied to Barry and not say that I missed that AND operation.
>
> Here is another go.
>
> @cpus is per-cpu and irq disable. I am not sure irq disable can
> be preempt in in RT. If that is possiable, cpu_ptr is not safe.
> Based on the code and comments in __do_set_cpus_allowed(), he will
> not use ->pi_lock to change affinity(eg. SCA_MIGRATE_DISABLE).
>
> If the cpu_ptr not get changed(I must be right and wrong more 5 time)
> @core is not in @cpu_ptr mask, BUT, the cpu in smt mask of @core is
> the one that we use to call select_idle_core() and @core is in this
> smt mask but is not allowed and we get @*idle_cpu and @idle is true
> we return the @core. Another not sure.
>
> If the @cpu_ptr can changed in eg. just before the @core loop.
> It is possible that @*idle_cpu == -1 and all cpu in smt mask
> of @core is seemed to be not allowed. And, we return @core.
> This case is what I just look at select_idle_core() semantics
> and missed that AND operation.
>
> > -- Steve
>
>
>
>
> Thanks,
> Tao