[PATCH 0/1] RAPL (Running Average Power Limit) driver

From: Jacob Pan
Date: Tue Apr 02 2013 - 18:16:04 EST

RAPL(Running Average Power Limit) interface provides platform software
with the ability to monitor, control, and get notifications on SOC
power consumptions. Since its first appearance on Sandy Bridge, more
features have being added to extend its usage. In RAPL, platforms are
divided into domains for fine grained control. These domains include
package, DRAM controller, CPU core (Power Plane 0), graphics uncore
(power plane 1), etc.

The purpose of this driver is to expose RAPL for userspace
consumption. Overall, RAPL fits in the generic thermal layer in
that platform level power capping and monitoring are mainly used for
thermal management and thermal layer provides the abstracted interface
needed to have portable applications.

Specifically, userspace is presented with per domain cooling device
with sysfs links to its kobject. Although RAPL domain provides many
parameters for fine tuning, long term power limit is exposed as the
single knob via cooling device state. Whereas the rest of the
parameters are still accessible via the linked kobject. This simplifies
the interface for both simple and advanced use cases.

1. sysfs layout

As an x86 platform driver, RAPL driver binds with supported CPU ids
during probing phase. Once domains are discovered, kobjets are created
for each domain which are also linked with cooling devices after its
registration with the generic thermal layer.

e.g.package RAPL domain registered as cooling device #15, link "device"
back to its kobject.

âââ cur_state
âââ device -> ../../../platform/intel_rapl/rapl_domains/package
âââ max_state
âââ power
âââ subsystem -> ../../../../class/thermal
âââ type
âââ uevent

In driver's private sysfs area, domains kobjects are grouped under a
kset which exposes global data.
âââ driver -> ../../../bus/platform/drivers/intel_rapl
âââ power
âââ rapl_domains
â âââ package
â â âââ thermal_cooling
-> ../../../../virtual/thermal/cooling_device15
â âââ power_plane_0
â â âââ thermal_cooling
-> ../../../../virtual/thermal/cooling_device16
â âââ power_plane_1
â âââ thermal_cooling
-> ../../../../virtual/thermal/cooling_device18
âââ subsystem -> ../../../bus/platform

2. per domain parameters

These are the fine tuning parameters only used by advanced
power/thermal management applications. Refer to Intel SDM ch14 for

root@chromoly:/sys/class/thermal/cooling_device15/device# grep . *

3. event notifications

RAPL driver uses eventfd to provide userspace notifications on selected
events. A file node called "event_control" is created for each RAPL
domain. User can write control file descriptor, eventfd descriptor, and
threshold to event_control file. Then, user application can use
poll/select or blocking read to get notifications from the driver.
Multiple events are allowed for each domain but only a single threshold
is accepted.

4. Usage Examples (assume the topology in the sysfs layout above)

- set power limit to package domain (whole SOC package) to 6w
root@chromoly:~# echo 6000
> /sys/class/thermal/cooling_device15/cur_state

- set power limit to pp1 domain (graphics) to 4w
root@chromoly:~# echo 4000
> /sys/class/thermal/cooling_device18/cur_state

- check the current power usage in mWatts of pp1 domain
root@chromoly:~# cat /sys/class/thermal/cooling_device18/cur_state

- set event notification when power consumption of graphics unit crosses
event_fd_listener /sys/class/thermal/cooling_device18/device/power 5000
(event_fd_listener opens control file power and creates an eventfd,
then write efd, cfd, threshold to event_control file of the given


1. Package power limit events are supported by legacy thermal reporting
mechanism, which uses local APIC thermal vector to generate interrupts
when targeted P-states are not honored by the HW/FW. This is tied to
machine check reporting. Until RAPL is used, this notification is a rare
exception. When RAPL power limit is set artifically low, this
notification could result in unwanted interrupts for each power limit
excursion. Therefore, RAPL driver attempts to turn off the power limit
notification interrupt when user sets a power limit.

2. By Intel Software Developer's Manual, RAPL interface can report
max/min power for certain domains. But in reality HW often reports 0
for max/min power. RAPL driver tackles this problem by using thermal
specification power or current power limit1 when max power information
is not available. The result is that the max_state of a RAPL cooling
device can be based on thermal spec power or power limit 1.

3. Since RAPL is backed by FW. In case of FW failure or plain lack of
support, setting RAPL power limit could result in silent failure. I
don't have a good solution for that.

4. Data polling starts only when the following items are set
- power limit
- events

Jacob Pan (1):
Introduce Intel RAPL cooling device driver

drivers/platform/x86/Kconfig | 8 +
drivers/platform/x86/Makefile | 1 +
drivers/platform/x86/intel_rapl.c | 1323 +++++++++++++++++++++++++++++++++++++
drivers/platform/x86/intel_rapl.h | 249 +++++++
4 files changed, 1581 insertions(+)
create mode 100644 drivers/platform/x86/intel_rapl.c
create mode 100644 drivers/platform/x86/intel_rapl.h


To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/