Re: [PATCH] perf events: Add stalled cycles generic event - PERF_COUNT_HW_STALLED_CYCLES

From: Arun Sharma
Date: Sun Apr 24 2011 - 02:22:11 EST

Next message: Paul E. McKenney: "Re: linux-next: Tree for April 14 (Call-traces: RCU/ACPI/WQrelated?)"
Previous message: KOSAKI Motohiro: "Re: [patch] m68k, mm: set all online nodes in N_NORMAL_MEMORY"
Next in thread: Ingo Molnar: "Re: [PATCH] perf events: Add stalled cycles generic event -PERF_COUNT_HW_STALLED_CYCLES"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Sat, Apr 23, 2011 at 10:14:09PM +0200, Ingo Molnar wrote:
>
> The new PERF_COUNT_HW_STALLED_CYCLES event tries to approximate
> cycles the CPU does nothing useful, because it is stalled on a
> cache-miss or some other condition.

Conceptually looks fine. I'd prefer a more precise name such as:
PERF_COUNT_EXECUTION_STALLED_CYCLES (to differentiate from frontend or
retirement stalls).

In the example below:

==> foo.c <==

foo()
{
}

bar()
{
}

==> test.c <==
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>

#define FNV_PRIME_32 16777619
#define FNV_OFFSET_32 2166136261U
uint32_t hash1(const char *s)
{
uint32_t hash = FNV_OFFSET_32, i;
for(i = 0; i < 4; i++)
{
hash = hash ^ (s[i]); // xor next byte into the bottom of the hash
hash = hash * FNV_PRIME_32; // Multiply by prime number found to work well
}
return hash;
}

#define FNV_PRIME_WEAK_32 100
#define FNV_OFFSET_WEAK_32 200

uint32_t hash2(const char *s)
{
uint32_t hash = FNV_OFFSET_WEAK_32, i;
for(i = 0; i < 4; i++)
{
hash = hash ^ (s[i]); // xor next byte into the bottom of the hash
hash = hash * FNV_PRIME_WEAK_32; // Multiply by prime number found to work well
}
return hash;
}

int main()
{
int r = random();

while(1) {
r++;
#ifdef HARD
if (hash1((const char *) &r) & 0x500)
#else
if (hash2((const char *) &r) & 0x500)
#endif
foo();
else
bar();
}
}
==> Makefile <==

all:
gcc -O2 test.c foo.c -UHARD -o test.easy
gcc -O2 test.c foo.c -DHARD -o test.hard

# perf stat -e cycles,instructions ./test.hard
^C
Performance counter stats for './test.hard':

3,742,855,848 cycles
4,179,896,309 instructions # 1.117 IPC

1.754804730 seconds time elapsed

# perf stat -e cycles,instructions ./test.easy
^C
Performance counter stats for './test.easy':

3,932,887,528 cycles
8,994,416,316 instructions # 2.287 IPC

1.843832827 seconds time elapsed

i.e. fixing the branch mispredicts could result in
a nearly 2x speedup for the program.

Looking at:

# perf stat -e cycles,instructions,branch-misses,cache-misses,RESOURCE_STALLS:ANY ./test.hard
^C
Performance counter stats for './test.hard':

3,413,713,048 cycles (scaled from 69.93%)
3,772,611,782 instructions # 1.105 IPC (scaled from 80.01%)
51,113,340 branch-misses (scaled from 80.01%)
12,370 cache-misses (scaled from 80.02%)
26,656,983 RESOURCE_STALLS:ANY (scaled from 69.99%)

1.626595305 seconds time elapsed

it's hard to spot the opporunity. On the other hand:

# ./analyze.py
Percent idle: 27%
Retirement Stalls: 82%
Backend Stalls: 0%
Frontend Stalls: 62%
Instruction Starvation: 62%
icache stalls: 0%

does give me a signal about where to look. The script below is
a quick and dirty hack. I haven't really validated it with
many workloads. I'm posting it here anyway hoping that it'd
result in better kernel support for these types of analyses.

Even if we cover this with various generic PERF_COUNT_*STALL events,
we'll still have a need for other events:

* Things that give info about instruction mixes.

Ratio of {loads, stores, floating point, branches, conditional branches}
to total instructions.

* Activity related to micro architecture specific caches

People using -funroll-loops may have a significant performance opportunity.
But it's hard to spot bottlenecks in the instruction decoder.

* Monitoring traffic on Hypertransport/QPI links

Like you observe, most people will not look at these events, so
focusing on getting the common events right makes sense. But I
still like access to all events (either via a mapping file or
a library such as libpfm4). Hiding them in "perf list" sounds
like a reasonable way of keeping complexity out.

-Arun

PS: branch-misses:pp was spot on for the example above.

#!/usr/bin/env python

from optparse import OptionParser
from itertools import izip, chain, repeat
from subprocess import Popen, PIPE
import re, struct

def grouper(n, iterable, padvalue=None):
"grouper(3, 'abcdefg', 'x') --> ('a','b','c'), ('d','e','f'), ('g','x','x')"
return izip(*[chain(iterable, repeat(padvalue, n-1))]*n)

counter_re = re.compile('\s+(?P<count>\d+)\s+(?P<event>\S+)')
def sample(events):
cmd = 'perf stat --no-big-num -a'
ncounters = 4
groups = grouper(ncounters, events)
for g in groups:
# filter padding
g = [ e for e in g if e ]
cmd += ' -e ' + ','.join(g)
cmd += ' -- sleep ' + str(options.time)
process = Popen(cmd, shell=True, stdout=PIPE, stderr=PIPE)
out, err = process.communicate()
ret = process.poll()
if ret: raise "Perf failed: " + err
ret = {}
for line in err.split('\n'):
m = counter_re.match(line)
if not m: continue
ret[m.group('event')] = long(m.group('count'))
return ret

def measure_cycles():
# disable C-states
f = open("/dev/cpu_dma_latency", "wb")
f.write(struct.pack("i", 0))
f.flush()
saved = options.time
options.time = 1 # one sec is sufficient to measure clock
cycles = sample(["cycles"])['cycles']
cycles /= options.time
f.close()
options.time = saved
return cycles

if __name__ == '__main__':
parser = OptionParser()
parser.add_option("-t", "--time", dest="time", default=1,
help="How long to sample events")
parser.add_option("-q", "--quiet",
action="store_false", dest="verbose", default=True,
help="don't print status messages to stdout")

(options, args) = parser.parse_args()
cycles_per_sec = measure_cycles()
c = sample(["cycles", "instructions", "UOPS_ISSUED:ANY:c=1", "UOPS_ISSUED:ANY:c=1:t=1",
"RESOURCE_STALLS:ANY", "UOPS_RETIRED:ANY:c=1:t=1",
"UOPS_EXECUTED:PORT015:t=1", "UOPS_EXECUTED:PORT234_CORE",
"UOPS_ISSUED:ANY:t=1", "UOPS_ISSUED:FUSED:t=1", "UOPS_RETIRED:ANY:t=1",
"L1I:CYCLES_STALLED"])
cycles = c["cycles"] * 1.0
cycles_no_uops_issued = cycles - c["UOPS_ISSUED:ANY:c=1:t=1"]
cycles_no_uops_retired = cycles - c["UOPS_RETIRED:ANY:c=1:t=1"]
backend_stall_cycles = c["RESOURCE_STALLS:ANY"]
icache_stall_cycles = c["L1I:CYCLES_STALLED"]

# Cycle stall accounting
print "Percent idle: %d%%" % ((1 - cycles/(int(options.time) * cycles_per_sec)) * 100)
print "\tRetirement Stalls: %d%%" % ((cycles_no_uops_retired / cycles) * 100)
print "\tBackend Stalls: %d%%" % ((backend_stall_cycles / cycles) * 100)
print "\tFrontend Stalls: %d%%" % ((cycles_no_uops_issued / cycles) * 100)
print "\tInstruction Starvation: %d%%" % (((cycles_no_uops_issued - backend_stall_cycles)/cycles) * 100)
print "\ticache stalls: %d%%" % ((icache_stall_cycles/cycles) * 100)

# Wasted work
uops_executed = c["UOPS_EXECUTED:PORT015:t=1"] + c["UOPS_EXECUTED:PORT234_CORE"]
uops_retired = c["UOPS_RETIRED:ANY:t=1"]
uops_issued = c["UOPS_ISSUED:ANY:t=1"] + c["UOPS_ISSUED:FUSED:t=1"]

print "\tPercentage useless uops: %d%%" % ((uops_executed - uops_retired) * 100.0/uops_retired)
print "\tPercentage useless uops issued: %d%%" % ((uops_issued - uops_retired) * 100.0/uops_retired)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Paul E. McKenney: "Re: linux-next: Tree for April 14 (Call-traces: RCU/ACPI/WQrelated?)"
Previous message: KOSAKI Motohiro: "Re: [patch] m68k, mm: set all online nodes in N_NORMAL_MEMORY"
Next in thread: Ingo Molnar: "Re: [PATCH] perf events: Add stalled cycles generic event -PERF_COUNT_HW_STALLED_CYCLES"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]