Re: sched: ARM: arch_scale_freq_power

From: Vincent Guittot
Date: Tue Oct 11 2011 - 04:51:39 EST


On 11 October 2011 09:57, Peter Zijlstra <a.p.zijlstra@xxxxxxxxx> wrote:
> On Tue, 2011-10-11 at 12:46 +0530, Amit Kucheria wrote:
>> Adding Peter to the discussion..
>
> Right, CCing the folks who actually wrote the code you're asking
> questions about always helps ;-)
>
>> On Thu, Oct 6, 2011 at 5:06 PM, Vincent Guittot
>> <vincent.guittot@xxxxxxxxxx> wrote:
>> > I work to link the cpu_power of ARM cores to their frequency by using
>> > arch_scale_freq_power.
>
> Why and how? In particular note that if you're using something like the
> on-demand cpufreq governor this isn't going to work.
>

I have several goals. The 1st one is that I need to put more load on
some cpus when I have packages with different cpu frequency.
I also study if I can follow the real cpu frequency but it seems to be
not so easy. I have noticed that the cpu_power is updated periodical
except when we have a lot of newly_idle events.
Then, I have some use cases which have several running tasks but a low
cpu load. In this case, the small tasks are spread on several cpu by
the load_balance whereas they could be easily handled by one cpu
without significant performance modification. If the cpu_power is
higher than 1024, the cpu is no more seen out of capacity by the
load_balance as soon as a short process is running and teh main result
is that the small tasks will stay on the same cpu. This configuration
is mainly usefull for ARM dual core system when we want to power gate
one cpu. I use cyclictest to simulate such use case.

>> It's explained in the kernel that cpu_power is
>> > used to distribute load on cpus and a cpu with more cpu_power will
>> > pick up more load. The default value is SCHED_POWER_SCALE and I
>> > increase the value if I want a cpu to have more load than another one.
>> > Is there an advised range for cpu_power value as well as some time
>> > scale constraints for updating the cpu_power value ?
>
> Basically 1024 is the unit and denotes the capacity of a full core at
> 'normal' speed.
>
> Typically cpufreq would down-clock a core and thus you'd end up with a
> smaller number (linearly proportional to the freq ratio etc. although if
> you want to go really fancy you could determine the actual
> throughput/freq curves).
>
> Things like x86 turbo mode would result in a >1024 value.
>
> Things like SMT would typically result in <1024 and the SMT sum over the
> core >1024 (if you're lucky).
>
>> > I'm also wondering why this scheduler feature is currently disable by default ?
>
> Because the only implementation in existence (x86) is broken and I
> haven't gotten around to fixing it. Arguable we should disable that for
> the time being, see below.
>
>> In discussions with Vincent regarding this, I've wondered whether
>> cpu_power wouldn't be better renamed to cpu_capacity since that is
>> what it really seems to describe.
>
> Possibly, but its been cpu_power for ages and we use capacity to
> describe something else.
>
> ---
>  arch/x86/kernel/cpu/sched.c |    9 ++++++++-
>  1 files changed, 8 insertions(+), 1 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/sched.c b/arch/x86/kernel/cpu/sched.c
> index a640ae5..90ae68c 100644
> --- a/arch/x86/kernel/cpu/sched.c
> +++ b/arch/x86/kernel/cpu/sched.c
> @@ -6,7 +6,14 @@
>  #include <asm/cpufeature.h>
>  #include <asm/processor.h>
>
> -#ifdef CONFIG_SMP
> +#if 0 /* def CONFIG_SMP */
> +
> +/*
> + * Currently broken, we need to filter out idle time because the aperf/mperf
> + * ratio measures actual throughput, not capacity. This means that if a logical
> + * cpu idles it will report less capacity and receive less work, which isn't
> + * what we want.
> + */
>
>  static DEFINE_PER_CPU(struct aperfmperf, old_perf_sched);
>
>
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/