[RFC PATCH V2 00/19] Power Scheduler Design

From: Preeti U Murthy
Date: Mon Aug 11 2014 - 07:31:55 EST


The new power aware scheduling framework is being designed with a goal that
all the cpu power management is in one place. Today the power management
policies are fragmented between the cpuidle and cpufreq subsystems, which
makes power management inconsistent. To top this, we were integrating
task packing algorithms into the scheduler which could potentially worsen
the scenario.

The new power aware scheduler design will have all policies, all metrics,
all averaging concerning cpuidle and cpufrequency in one place, that being the
scheduler. This patchset lays the foundation for this approach to help
remove the existing fragmented approach towards cpu power savings.

NOTE: This patchset targets only cpuidle. cpu-frequency can be integrated into
this design on the same lines.

The design is broken down into incremental steps which will enable
easy validation of the power aware scheduler. This by no means is complete
and will require more work to get to a stage where it can beat the
current approach. Like I said this is just the foundation to help us
get started. The subsequent patches can be small incremental measured steps.

Ingo had pointed out this approach in http://lwn.net/Articles/552889/ and I
have tried my best at understanding and implementing the initial steps that
he suggested.

1.Start from the dumbest possible state: all CPUs are powered up fully,
there's no idle state selection essentially.

2.Then go for the biggest effect first and add the ability to idle in a
lower power state (with new functions and a low level driver that implements
this for the platform with no policy embedded into it.

3.Implement the task packing algorithm.

This patchset implements the above three steps and makes the fundamental design
of power aware scheduler clear. It shows how:

1.The design should be non intrusive with the existing code. It should be
enabled/disabled by a config switch. This way we can continue to work towards
making it better without having to worry about regressing the kernel and
yet have it in the kernel at the same time; a confidence booster that it is
making headway.
CONFIG_SCHED_POWER is the switch that makes the new code appear when turned on
and disappear and default to the original code when turned off.

2.The design should help us test it better. Like Ingo pointed out:

"Important: it's not a problem that the initial code won't outperform the
current kernel's performance. It should outperform the _initial_ 'dumb'
code in the first step. Then the next step should outperform the previous
step, etc.
The quality of this iterative approach will eventually surpass the
combined effect of currently available but non-integrated facilities."

This is precisely what this design does. PATCH[1/19] disables cpuidle and
cpufrequency sub systems altogether if CONFIG_SCHED_POWER is enabled.
This is the dumb code. Our subsequent patches should outperform this.

3. Introduce a low level driver which interfaces scheduler with C-state
switching. Again Ingo had pointed out this saying:
"It should be presented to the scheduler in a platform independent fashion,
but without policy embedded: a low level platform driver interface in essence."

PATCH[2/19] ensures that CPUIDLE governors no longer control
idle state selection. The idle state selection and policies are moved into
kernel/sched/power.c. True, its the same code from the menu governor, however
it has been moved into scheduler specific code and no longer functions like
a driver. Its meant to be part of the core kernel. The "low level driver" lives
under drivers/cpuidle/cpuidle.c like before. It registers platform specific
cpuidle drivers and does other low level stuff that the scheduler needn't
bother about. It has no policies embedded into it whatsoever. Importantly it
is an entry point to switching C states and nothing beyond that.

PATCH[3/19] enumerates idle states and parameters in the scheduler topology.
This is so that the scheduler knows the cost of entry/exit into
idle states that can be made use of going ahead. As an example, this patchset
shows how the platform specific cpuidle driver should help fill up the idle state
details into the topology. This fundamental information is missing today in the
scheduler.

These two patches are not expected to change the performance/power savings
in any way. They are just the first steps towards the integrated approach of
the power aware scheduler.

The patches PATCH[4/19] to PATCH[18/19] do task packing. This series is the
one that Alex Shi had posted long ago https://lkml.org/lkml/2013/3/30/78.
However this patch series will come into effect only if CONFIG_SCHED_POWER is
enabled. It is this series which is expected to bring about changes in
performance and power savings; not necessarily better than the existing code,
but certainly should be better than the dumb code.

Our subsequent efforts should surpass the performance/powersavings of the
existing code. This patch series is compile tested only.

V1 of this power efficient scheduling design was posted by Morten after
Ingo posted his suggestions on http://lwn.net/Articles/552889/.
[RFC][PATCH 0/9] sched: Power scheduler design proposal:
https://lkml.org/lkml/2013/7/15/101
But it decoupled the scheduler into the regular and power scheduler with
the latter controlling the cpus that could be used by the regular scheduler.
We do not need this kind of decoupling. With the foundation that this patch
set lays, it must be relatively easy to make the existing scheduler power
aware.

---

Alex Shi (16):
sched: add sched balance policies in kernel
sched: add sysfs interface for sched_balance_policy selection
sched: log the cpu utilization at rq
sched: add new sg/sd_lb_stats fields for incoming fork/exec/wake balancing
sched: move sg/sd_lb_stats struct ahead
sched: get rq potential maximum utilization
sched: detect wakeup burst with rq->avg_idle
sched: add power aware scheduling in fork/exec/wake
sched: using avg_idle to detect bursty wakeup
sched: packing transitory tasks in wakeup power balancing
sched: add power/performance balance allow flag
sched: pull all tasks from source grp and no balance for prefer_sibling
sched: add new members of sd_lb_stats
sched: power aware load balance
sched: lazy power balance
sched: don't do power balance on share cpu power domain

Preeti U Murthy (3):
sched/power: Remove cpu idle state selection and cpu frequency tuning
sched/power: Move idle state selection into the scheduler
sched/idle: Enumerate idle states in scheduler topology


Documentation/ABI/testing/sysfs-devices-system-cpu | 23 +
arch/powerpc/Kconfig | 1
arch/powerpc/platforms/powernv/Kconfig | 12
drivers/cpufreq/Kconfig | 2
drivers/cpuidle/Kconfig | 10
drivers/cpuidle/cpuidle-powernv.c | 10
drivers/cpuidle/cpuidle.c | 65 ++
include/linux/sched.h | 16 -
include/linux/sched/sysctl.h | 3
kernel/Kconfig.sched | 11
kernel/sched/Makefile | 1
kernel/sched/debug.c | 3
kernel/sched/fair.c | 632 +++++++++++++++++++-
kernel/sched/power.c | 480 +++++++++++++++
kernel/sched/sched.h | 16 +
kernel/sysctl.c | 9
16 files changed, 1234 insertions(+), 60 deletions(-)
create mode 100644 kernel/Kconfig.sched
create mode 100644 kernel/sched/power.c

--

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/