[RFC PATCH v1 0/3] Scaled statistics using APERF/MPERF in x86

From: Vaidyanathan Srinivasan
Date: Mon May 26 2008 - 10:30:28 EST


The following RFC patch tries to implement scaled CPU utilisation statistics
using APERF and MPERF MSR registers in an x86 platform.

The CPU capacity is significantly changed when the CPU's frequency is reduced
for the purpose of power savings. The applications that run at such lower CPU
frequencies are also accounted for real CPU time by default. If the
applications have been run at full CPU frequency, they would have finished the
work faster and not get charged for excessive CPU time.

One of the solution to this problem it so scale the utime and stime entitlement
for the process as per the current CPU frequency. This technique is used in
powerpc architecture with the help of hardware registers that accurately capture
the entitlement.

On x86 hardware, APERF and MPERF are MSR registers that can provide feedback on
current CPU frequency. Currently these registers are used to detect current CPU
frequency on each core in a multi-core x86 processor where the frequency of the
entire package is changed.

This patch demonstrates the idea of scaling utime and stime based on cpu
frequency. The scaled values are exported through taskstats delay accounting
infrastructure.

Example:

On a two socket two CPU x86 hardware:
./getdelays -d -l -m0-3

PID 4172


CPU count real total virtual total delay total
43873 148009250 3368915732 28751295
IO count delay total
0 0
MEM count delay total
0 0
utime stime
40000 108000
scaled utime scaled stime total
26676 72032 98714169

The utime/stime and scaled utime/stime are printed in micro secs while the
totals are in nano seconds. The CPU was running at 66% of its maximum frequency.

We can observe that scaled utime/stime values are 66% of their normal
accumulated runtime values, and total is 66% of 'real total'.

The following output is for CPU intensive job running for 10s:

PID 4134


CPU count real total virtual total delay total
61 10000625000 9807860434 2
IO count delay total
0 0
MEM count delay total
0 0
utime stime
10000000 0
scaled utime scaled stime total
9886696 0 9887313918

Ondemand governor was running and it took sometime to switch the frequency to
maximum. Hence the scaled values are marginally less than that of the elapsed
utime.


Limitations:

* RFC patch to communicate just the idea, implementation may need rework
* Works only for 32-bit x86 hardware
* MSRs and APERF/MPERF ratio is calculated at every context switch which is very
slow
* Hacked cputime_t task_struct->utime to hold 'jiffies * 1000' values just to
account for fractional jiffies. Since cputime_t is jiffies in x86, we cannot
add fractional jiffies at each context switch. Need to convert the scaled
utime/stime data types and units to micro seconds or nano seconds.


ToDo:

* Compute scaling ratio per package only at each frequency switch
-- Notify frequency change to all affected CPUs
* Use more accurate time unit for x86 scaled utime and stime

Signed-off-by: Vaidyanathan Srinivasan <svaidy@xxxxxxxxxxxxxxxxxx>

---

Vaidyanathan Srinivasan (3):
Print scaled utime and stime in getdelays
Make calls to account_scaled_stats
General framework for APERF/MPERF access and accounting


Documentation/accounting/getdelays.c | 13 ++
arch/x86/kernel/cpu/cpufreq/acpi-cpufreq.c | 21 +++
arch/x86/kernel/process_32.c | 8 +
arch/x86/kernel/time_32.c | 171 ++++++++++++++++++++++++++++
include/linux/hardirq.h | 4 +
kernel/delayacct.c | 7 +
kernel/timer.c | 2
kernel/tsacct.c | 10 +-
8 files changed, 225 insertions(+), 11 deletions(-)

--
Vaidyanathan Srinivasan,
Linux Technology Center,
IBM India Systems and Technology Labs.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/