Re: sched: ARM: arch_scale_freq_power

From: Peter Zijlstra
Date: Tue Oct 11 2011 - 06:03:29 EST


On Tue, 2011-10-11 at 15:08 +0530, Amit Kucheria wrote:

> > That shouldn't be done using cpu_power, we have sched_smt_power_savings
> > and sched_mc_power_savings for stuff like that.
>
> AFAICT, sched_mc assume all cores to have the same capacity - which is
> certainly true of the x86 architecture. But in ARM you can see hybrid
> cores[1] designed using different fab technology, so that some cores
> can run at 'n' GHz and some at 'm' GHz. The idea being that when there
> isn't much to do (e.g periodic keep alives for messaging, email, etc.)
> you don't wake up the higher power-consuming cores.
>
> From TFA[1], "Sheeva was already capable of 1.2GHz, but the new
> design can go up to 1.5GHz. But only two of the 628's Sheeva cores run
> at the full 1.5GHz. The third one is down-clocked to 624MHz, and
> interesting design choice that saves on power but adds some extra
> utility. In a sense, the 628 could be called a 2.5-core design."

Cute :-)

> Are we mistaken in thinking that sched_mc can not currently handle
> this usecase? How would we 'tune' sched_mc to do this w/o playing with
> cpu_power?

Yeah, sched_mc wants some TLC there.

> > Although I would really like to kill all those different
> > sched_*_power_savings knobs and reduce it to one.
> >
> >> If the cpu_power is
> >> higher than 1024, the cpu is no more seen out of capacity by the
> >> load_balance as soon as a short process is running and teh main result
> >> is that the small tasks will stay on the same cpu. This configuration
> >> is mainly usefull for ARM dual core system when we want to power gate
> >> one cpu. I use cyclictest to simulate such use case.
> >
> > Yeah, but that's wrong.
>
> What is wrong - the use case simulation using cyclictest? Can you
> suggest better tools?

Using cpu_power to do power saving load-balancing like that.

So ideally cpu_power is simply a factor in the weight balance decision
such that:

cpu_weight_i cpu_weigjt_j
------------ ~= ------------
cpu_power_i cpu_power_j

This yields that under sufficient[*] load, eg. 5 equal weight tasks and
your 2.5 core thingy, you'd get 2:2:1

The decision on what to do on under-utilized systems should be separate
from this.

Currently the load-balancer doesn't know about 'short' running processes
at all, we just have nr_running and weight it doesn't know/care about
for how long those will be around for.

Now for some of the cgroup crap we track a time-weighted weight average,
and pjt was talking about pulling that up into the normal code to get
rid of our multitude of different ways to calculate actual load. [**]

(/me pokes pjt with a sharp stick, where those patches at!?)

But that only gets you half-way there, you also need to compute an
effective time-weighted load per task to go with that.. now while all
that is quite feasible, the problem is overhead. We very much already
are way to expensive and should be cutting back, not keep adding more
and more accounting.

[*] Sufficient such that the weight problem is feasible. eg. 3 equal
tasks on 2 equal cores can never be statically balanced, 2 unequal tasks
on 2 equal cores (or v.v.) can't ever be balanced.

[**] I suspect this might solve the over-balancing problem triggered by
tasks woken from the tick that also does the load-balance pass. This
load-balance pass will run in sIRQ context and thus preempt running all
those just woken tasks, thus giving the impression the CPU is very busy,
while in fact most those tasks will instantly go back to sleep after
finding nothing to do.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/