Re: sched: ARM: arch_scale_freq_power

From: Vincent Guittot
Date: Tue Oct 11 2011 - 12:03:32 EST


On 11 October 2011 12:27, Peter Zijlstra <a.p.zijlstra@xxxxxxxxx> wrote:
> On Tue, 2011-10-11 at 11:40 +0200, Vincent Guittot wrote:
>> On 11 October 2011 11:13, Peter Zijlstra <a.p.zijlstra@xxxxxxxxx> wrote:
>> > On Tue, 2011-10-11 at 10:51 +0200, Vincent Guittot wrote:
>> >> I have several goals. The 1st one is that I need to put more load on
>> >> some cpus when I have packages with different cpu frequency.
>> >
>> > That should be rather easy.
>> >
>>
>> I agree, I was mainly wondering If I should use a [1-1024] or a
>> [1024-xxxx] range and it seems that both can be used according : SMT
>> uses <1024 and x86 turbo mode uses >1024
>
> Well, turbo mode would typically only boost a cpu 25% or so, and only
> while idling other cores to keep under its thermal limit. So its not
> sufficient to actually affect the capacity calculation much if at all.
>

OK

>> >> Then, I have some use cases which have several running tasks but a low
>> >> cpu load. In this case, the small tasks are spread on several cpu by
>> >> the load_balance whereas they could be easily handled by one cpu
>> >> without significant performance modification.
>> >
>> > That shouldn't be done using cpu_power, we have sched_smt_power_savings
>> > and sched_mc_power_savings for stuff like that.
>> >
>>
>> sched_mc_power_saving works fine when we have more than 2 cpus but
>> can't apply on a dual core because it needs at least 2 sched_groups
>> and the nr_running of these sched_groups must be higher than 0 but
>> smaller than group_capacity which is 1 on a dual core system.
>
> SD_POWERSAVINGS_BALANCE does /=2 to nr_running, effectively doubling the
> capacity iirc. And I know some IBM dudes were toying with the idea of
> playing tricks with the capacity numbers, but that never went anywhere.
>

yes but it's only a special case for 2 tasks on a dual core and the
SD_WAKE_AFFINE flag and cpu_idle_sibling can overwrite this decision.

>> > Although I would really like to kill all those different
>> > sched_*_power_savings knobs and reduce it to one.
>> >
>> >> If the cpu_power is
>> >> higher than 1024, the cpu is no more seen out of capacity by the
>> >> load_balance as soon as a short process is running and teh main result
>> >> is that the small tasks will stay on the same cpu. This configuration
>> >> is mainly usefull for ARM dual core system when we want to power gate
>> >> one cpu. I use cyclictest to simulate such use case.
>> >
>> > Yeah, but that's wrong.
>>
>> That's the only way I have found to gathers small task without any
>> relationship on one cpu. Do you know any better solution ?
>
> How do you know the task is 'small' ?
>

I want to use cpufreq to be notified that we have a large/small cpu
load. If we have several tasks but the cpu uses the lowest frequency,
it "should" mean that we have small tasks that are running (less than
20ms*95% of added duration) and we could gather them on one cpu (by
increasing the cpu_power on a dual core).

> For that you would need to track a time-weighted effective load average
> of the task and we don't have that.
>

yes, that's why I use cpufreq until better option, like a
time-weighted load average, is available

> [ how bad is all this u64 math on ARM btw? and when will ARM finally
>  agree all this 32bit nonsense is a waste of time and silicon? ]
>
> But yeah, the whole nr_running vs capacity thing was traditionally to
> deal with spreading single tasks around. And traditional power aware
> scheduling was mostly about packing those on sockets (keeps other
> sockets idle) instead of spreading them around sockets (optimizes
> cache).
>
> Now I wouldn't at all mind you ripping out all that
> sched_*_power_savings crap and replacing it, I doubt it actually works
> anyway. I haven't got many patches on the subject, and I know I don't
> have the equipment to measure power usage.
>
> Also, the few patches I got mostly made the sched_*_power_savings mess
> bigger, which I refuse to do (what sysad wants to have a 27-state space
> to configure his power aware scheduling). This has mostly made people go
> away instead of fixing things up :-(
>
> As to what the replacement would have to look like, dunno, its not
> something I've really thought much about, but maybe the time-weighted
> stuff is the only sane approach, that combined with options on how to
> spread tasks (core, socket, node, etc..).
>
> I really think changing the load-balancer is the right way to go about
> solving your power issue (hot-plugging a cpu really is an insane way to
> idle a core) and I'm open to discussing what would work for you.
>

Great. My 1st goal was not to modify the load-balancer and sched_mc
(or as less as possible) and to study how I could tune the scheduler
parameters to have the best power consumption on ARM platform. Now,
changing the load-balancer is probably a better solution.

> All I really ask is to not cobble something together, the load-balancer
> is a horridly complex thing already and the last thing it needs is more
> special cases that don't interact properly.
>
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/