Looking at the discussion it seems that people have slightly different
views, but most agree that the goal is an integrated scheduling,
frequency, and idle policy like you pointed out from the beginning.
What is less clear is how such design would look like. Catalin has
suggested two different approaches. Integrating cpufreq into the load
balancing, or let the scheduler focus on load balancing and extend
cpufreq to also restrict number of cpus available to the scheduler using
cpu_power. The former approach would increase the scheduler complexity
significantly as I already highlighted in my first reply. The latter
approach introduces a way to, at lease initially, separate load
balancing from capacity management, which I think is an interesting
approach. Based on this idea I propose the following design:
+-----------------+
| | +----------+
current load | Power scheduler |<----+ cpufreq |
+--------->| sched/power.c +---->| driver |
| | | +----------+
| +-------+---------+
| ^ |
+-----+---------+ | |
| | | | available capacity
| Scheduler |<--+----+ (e.g. cpu_power)
| sched/fair.c | |
| +--+|
+---------------+ ||
^ ||
| v|
+---------+--------+ +----------+
| task load metric | | cpuidle |
| arch/* | | driver |
+------------------+ +----------+
The intention is that the power scheduler will implement the (unified)
power policy. It gets the current load of the system from the scheduler.
Based on this information it will adjust the compute capacity available
to the scheduler and drive frequency changes such that enough compute
capacity is available to handle the current load. If the total load can
be handled by a subset of cpus, it will reduce the capacity of the
excess cpus to 0 (cpu_power=1). Likewise, if the load increases it will
increase capacity of one or more idle cpus to allow the scheduler to
spread the load. The power scheduler has knowledge about the power
topology and will guide the scheduler to idle the most optimum cpus by
reducing its capacity. Global idle decision will be handled by the power
scheduler, so cpuidle can over time be reduced to become just a driver,
once we have added C-state selection to the power scheduler.
The scheduler is left to focus on scheduling mechanics and finding the
best possible load balance on the cpu capacities set by the power
scheduler. It will share a detailed view of the current load with the
power scheduler to enable it to make the right capacity adjustments. The
scheduler will need some optimization to cope better with asymmetric
compute capacities. We may want to reduce capacity of some cpu to
increase their idle time while letting others take the majority of the
load.
Frequency scaling has a problematic impact on PJT's load metic, which
was pointed out a while ago by Chris Redpath
<https://lkml.org/lkml/2013/4/16/289>. So I agree with Arjan's
suggestion to change the load calculation basis to something which is
frequency invariant. Use whatever counters that are available on the
specific platform.
I'm aware that the scheduler and power scheduler decisions may be
inextricably linked so we may decide to merge them. However, I think it
is worth trying to keep the power scheduling decisions out of the
scheduler until we have proven it infeasible.
We are going to start working on this design and see where it takes us.
We will post any results and suggested patches for folk to comment on.
As a starting point we are planning to create a power scheduler
(kernel/sched/power.c) similar to a cpufreq governor that does capacity
management, and then evolve the solution from there.