Re: [PATCH 1/4] sched/topology: Store root domain CPU capacity sum

From: Vincent Guittot
Date: Thu Apr 09 2020 - 10:14:24 EST


On Thu, 9 Apr 2020 at 15:50, Dietmar Eggemann <dietmar.eggemann@xxxxxxx> wrote:
>
> On 08.04.20 19:03, Vincent Guittot wrote:
> > On Wed, 8 Apr 2020 at 18:31, Dietmar Eggemann <dietmar.eggemann@xxxxxxx> wrote:
> >>
> >> On 08.04.20 14:29, Vincent Guittot wrote:
> >>> On Wed, 8 Apr 2020 at 11:50, Dietmar Eggemann <dietmar.eggemann@xxxxxxx> wrote:
> >>
> >> [...]
> >>
> >>>> /**
> >>>> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> >>>> index 8344757bba6e..74b0c0fa4b1b 100644
> >>>> --- a/kernel/sched/topology.c
> >>>> +++ b/kernel/sched/topology.c
> >>>> @@ -2052,12 +2052,17 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
> >>>> /* Attach the domains */
> >>>> rcu_read_lock();
> >>>> for_each_cpu(i, cpu_map) {
> >>>> + unsigned long cap = arch_scale_cpu_capacity(i);
> >>>
> >>> Why do you replace the use of rq->cpu_capacity_orig by
> >>> arch_scale_cpu_capacity(i) ?
> >>> There is nothing about this change in the commit message
> >>
> >> True. And I can change this back.
> >>
> >> It seems though that the solution is not sufficient because of the
> >> 'rd->span &nsub cpu_active_mask' issue discussed under patch 2/4.
> >>ap
> >> But this remind me of another question I have.
> >>
> >> Currently we use arch_scale_cpu_capacity() more often (16 times) than
> >> capacity_orig_of()/rq->cpu_capacity_orig .
> >>
> >> What's hindering us to remove rq->cpu_capacity_orig and the code around
> >> it and solely rely on arch_scale_cpu_capacity()? I mean the arch
> >> implementation should be fast.
> >
> > Or we can do the opposite and only use capacity_orig_of()/rq->cpu_capacity_orig.
> >
> > Is there a case where the max cpu capacity changes over time ? So I
> > would prefer to use cpu_capacity_orig which is a field of scheduler
> > instead of always calling an external arch specific function
>
> I see. So far it only changes during startup.
>
> And it looks like that asym_cpu_capacity_level() [topology.c] would fail
> if we would use capacity_orig_of() instead of arch_scale_cpu_capacity().

Yes I agree. See below

>
> post_init_entity_util_avg() [fair.c] and sugov_get_util()
> [cpufreq_schedutil.c] would be temporarily off until
> update_cpu_capacity() has updated cpu_rq(cpu)->cpu_capacity_orig.

I think that we could even get rid of this update in
update_cpu_capacity(). cpu_capacity_orig should be set while building
the sched_domain topology because the topology itself is built based
on this max cpu capacity with asym_cpu_capacity_level(). So changing
the capacity without rebuilding the domain could break the
sched_domain topology correctness.

And we can't really set cpu_capacity_orig earlier during the boot
because the capacity of b.L is set late during the boot and a rebuild
of the sched_domain topology is then triggered.

>
> compute_energy() [fair.c] is guarded by sched_energy_enabled() from
> being used at startup.
>
> scale_rt_capacity() could be changed in case we call it after the
> cpu_rq(cpu)->cpu_capacity_orig = arch_scale_cpu_capacity(cpu) in
> update_cpu_capacity().

With the removal of the update in update_cpu_capacity(), we don't have
a problem anymore, isn't it ?

>
> The Energy Model (and CPUfreq cooling) code would need
> capacity_orig_of() exported. arch_scale_cpu_capacity() currently is
> exported via include/linux/sched/topology.h.

Not sure that we need to export it outside scheduler, they can still
use arch_scale_cpu_capacity()

>
> I guess Pelt and 'scale invariant Deadline bandwidth enforcement' should
> continue using arch_scale_cpu_capacity() in sync with
> arch_scale_freq_capacity().

Why can't they use capacity_orig_of ?
we keep using arch_scale_freq_capacity() because it's dynamic but we
don't really need to keep using arch_scale_cpu_capacity()

>
> IMHO it's hard to give clear advice when to use the one or the other.
>
> We probably don't want to set cpu_rq(cpu)->cpu_capacity_orig in the arch
> cpu scale setter. We have arch_scale_cpu_capacity() to decouple that.

Yes I agree