Re: [RFC PATCH 06/16] arm: topology: Define TC2 sched energy and provide it to scheduler

From: Peter Zijlstra
Date: Tue Jun 10 2014 - 06:16:39 EST


On Sun, Jun 08, 2014 at 07:26:29AM +0800, Yuyang Du wrote:
> Ok. I think we understand each other. But one more thing, I said P ~ V^3,
> because P ~ V^2*f and f ~ V, so P ~ V^3. Maybe some frequencies share the same
> voltage, but you can still safely assume V changes with f in general, and it
> will be more and more so, since we do need finer control over power consumption.

I didn't know the frequency part was proportionate to another voltage
term, ok, then the cubic term makes sense.

> > Sure, but realize that we must fully understand this governor and
> > integrate it in the scheduler if we're to attain the goal of IPC/watt
> > optimized scheduling behaviour.
> >
>
> Attain the goal of IPC/watt optimized?
>
> I don't see how it can be done like this. As I said, what is unknown for
> prediction is perf scaling *and* changing workload. So the challenge for pstate
> control is in both. But I see more chanllenge in the changing workload than
> in the performance scaling or the resulting IPC impact (if workload is
> fixed).

But for the scheduler the workload change isn't that big a problem; we
know the history of each task, we know when tasks wake up and when we
move them around. Therefore we can fairly accurately predict this.

And given a simple P state model (like ARM) where the CPU simply does
what you tell it to, that all works out. We can change P-state at task
wakeup/sleep/migration and compute the most efficient P-state, and task
distribution, for the new task-set.

> Currently, all freq governors take CPU utilization (load%) as the indicator
> (target), which can server both: workload and perf scaling.

So the current cpufreq stuff is terminally broken in too many ways; its
sampling, so it misses a lot of changes, its strictly cpu local, so it
completely misses SMP information (like the migrations etc..)

If we move a 50% task from CPU1 to CPU0, a sampling thing takes time to
adjust on both CPUs, whereas if its scheduler driven, we can instantly
adjust and be done, because we _know_ what we moved.

Now some of that is due to hysterical raisins, and some of that due to
broken hardware (hardware that needs to schedule in order to change its
state because its behind some broken bus or other). But we should
basically kill off cpufreq for anything recent and sane.

> As for IPC/watt optimized, I don't see how it can be practical. Too micro to
> be used for the general well-being?

What other target would you optimize for? The purpose here is to build
an energy aware scheduler, one that schedules tasks so that the total
amount of energy, for the given amount of work, is minimal.

So we can't measure in Watt, since if we forced the CPU into the lowest
P-state (or even C-state for that matter) work would simply not
complete. So we need a complete energy term.

Now. IPC is instructions/cycle, Watt is Joule/second, so IPC/Watt is

instructions second
------------ * ------ ~ instructions / joule
cycle joule

Seeing how both cycles and seconds are time units.

So for any given amount of instructions, the work needs to be done, we
want the minimal amount of energy consumed, and IPC/Watt is the natural
metric to measure this over an entire workload.

Attachment: pgpnAfnpaDvgO.pgp
Description: PGP signature