[RFC][PATCH 0/9] perf trace: support for general-purpose scripting

From: Tom Zanussi
Date: Tue Oct 06 2009 - 02:11:32 EST


Hi,

This patchset defines a generic interface for processing the binary
output of 'perf trace' and making it directly available to
user-defined scripts written in general-purpose scripting languages
such as Perl or Python. It also builds a general-purpose Perl
scripting trace processor on top of the new interface and provides a
few example scripts that exercise the basic functionality.

The main motivation behind it is to provide a more efficient and
powerful alternative to the popular method of parsing the ascii trace
output in order to extract useful information from it. To avoid the
overhead and complexity of all that, this patchset provides a
direct-to-script-interpreter pathway for doing the same thing, but in
a more regularized fashion, one that takes advantage of all the event
meta-info provided by the tracing infrustructure, such as the
event/field info contained in the 'format files' designed for that
purpose.

Anyone who still wants to deal with the trace stream as straight ascii
text can still have the scripts generate it (and in fact the interface
provides a way to have event printing code auto-generated. Also, the
original ascii output mode is preserved when the scripting options
aren't used), but for anyone wanting to do something more interesting
with the data using a rapid prototyping language such as Perl, this
patchset makes it easy to do so. It allows the full power of
general-purpose scripting languages to be applied to the trace stream
to do non-trivial analyses, and suddenly makes available huge
libraries of useful tools and modules (e.g. CPAN for Perl) to be
applied to the problem of creating new and interesting trace
applications.

This patchset only implements a Perl interface, but the intention is
to make it relatively easy to add support for other languages such as
Python, Ruby, etc, etc - all they need to do is follow the example of
the Perl implementation and provide their own implementations of the
trace_scripting_operations (more details below).

It's loosely based on the 'zedtrace' tracer I posted awhile back and
still has some rough edges, but it should be useful enough to try out
and get something useful out of.

Known problems/shortcomings:

Probably the biggest problem right now is the sorting hack I added as
the last patch. It's just meant as a temporary thing, but is there
because tracing scripts in general want to see events in the order
they happened i.e. timestamp order. Considering that all this would
probably be much more useful if the script handling could happen in
essentially real time i.e. in 'live-mode', some work probably needs to
be done to create a sort that can work on small batches of events, as
they come in (and making sure no older ones can sneak in after a batch
has been processed).

For the Perl implemention I decided not to pass all of the available
event fields to the script handlers, only the ones that I thought
would be commonly used. So for example, common_flags and
common_preempt_count aren't passed. The idea would be that these
unpassed fields should still be available to the event handler, but
via a library function available to the script. This would also be
true for any other metadata it might be useful to give scripts access
to. It should be easy to do, but I'll have to do some more looking
into the Perl->C interface.

Even with some of the simple examples provided here, its apparent that
the recording code needs to be able to specify that some filtering be
done in the kernel. Maybe have a higher-level command that invokes
both 'perf record', passing it filters defined in a 'perf trace'
script and piping the binary output to 'perf trace' instead of writing
it to a file.

Anyway, here are some simple examples to get started...

First, you should build and install perf after installing the perl
development library, libperl (e.g. apt-get install liberl-dev in
ubuntu).

Having done that, you can run 'perf record' with a set of tracepoint
events you're interested in e.g.:

root@tropicana:~# perf record -c 1 -f -a -R -e kmem:kmalloc -e kmem:kfree -e irq:softirq_entry -e irq:softirq_exit -e sched:sched_switch
^C[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.485 MB perf.data (~21198 samples) ]

Next, you can use the -p -g options of perf trace to have it generate
a script with one handler function for each event in the trace:

root@tropicana:~# perf trace -p -g

The name of the generated script will be 'perf-trace.pl':

root@tropicana:~# ls -al *.pl
-rw-r--r-- 1 root root 2904 2009-10-04 20:24 perf-trace.pl

Here's a portion of the Perl code generated for this trace, to give
some idea of what a trace script looks like:

# perf trace event handlers, generated by perf trace -p -g

# The common_* event handler fields are the most useful fields common to
# all events. They don't necessarily correspond to the 'common_*' fields
# in the status files. Those fields not available as handler params can
# be retrieved via script functions of the form get_common_*().

use lib "$ENV{'PERF_EXEC_PATH'}/scripts/perl/Perf-Trace-Util/lib";
use lib "./Perf-Trace-Util/lib";
use Perf::Trace::Core;
use Perf::Trace::Util;

sub trace_begin
{
# optional
}

sub trace_end
{
# optional
}

sub irq::softirq_entry
{
my ($event_name, $common_cpu, $common_secs, $common_nsecs, $common_pid,
$common_comm, $vec) = @_;

print_header($event_name, $common_cpu, $common_secs, $common_nsecs,
$common_pid, $common_comm);

printf("vec=%s\n", symbol_str("irq::softirq_entry", "vec", $vec));
}

sub kmem::kfree
{
my ($event_name, $common_cpu, $common_secs, $common_nsecs, $common_pid,
$common_comm, $call_site, $ptr) = @_;

print_header($event_name, $common_cpu, $common_secs, $common_nsecs,
$common_pid, $common_comm);

printf("call_site=%u, ptr=%u\n", $call_site, $ptr);
}

sub kmem::kmalloc
{
my ($event_name, $common_cpu, $common_secs, $common_nsecs, $common_pid,
$common_comm,
$call_site, $ptr, $bytes_req, $bytes_alloc, $gfp_flags) = @_;

print_header($event_name, $common_cpu, $common_secs, $common_nsecs,
$common_pid, $common_comm);

printf("call_site=%u, ptr=%u, bytes_req=%u, bytes_alloc=%u, gfp_flags=%s\n",
$call_site, $ptr, $bytes_req, $bytes_alloc,
flag_str("kmem::kmalloc", "gfp_flags", $gfp_flags));
}
.
.
.

For each event in the trace, the generated event handlers will print
the event name along with all the fields for the event. To have the
script do that, run perf trace with the -p -s options:

root@tropicana:~# perf trace -p -s perf-trace.pl

The output should look something like this:

kmem::kmalloc 1 04751.387886420 7249 perf call_site=18446744071579906539, ptr=18446612133245372416, bytes_req=640, bytes_alloc=1024, gfp_flags=GFP_WAIT | GFP_IO | GFP_ZERO
kmem::kmalloc 1 04751.387947149 7249 perf call_site=18446744071579905993, ptr=18446612134690881536, bytes_req=1112, bytes_alloc=2048, gfp_flags=GFP_WAIT | GFP_IO | GFP_ZERO
irq::softirq_entry 0 04751.388019731 0 swapper vec=TIMER
irq::softirq_exit 0 04751.388025089 0 swapper vec=TIMER
sched::sched_switch 1 04751.422756873 7249 perf prev_comm=perf, prev_pid=7249, prev_prio=120, prev_state=R, next_comm=apache2, next_pid=5659, next_prio=120
sched::sched_switch 1 04751.422783231 5659 apache2 prev_comm=apache2, prev_pid=5659, prev_prio=120, prev_state=S, next_comm=perf, next_pid=7249, next_prio=120
.
.
.

Notice the gfp_flags and vec fields in the previous trace snippet.
Not only does the generated script display the normal string and
numeric fields, but it also automatically figures out that certain
fields, though numeric, should be interpreted as flags or symbols
instead, and generates the code to do that (using the flag and
symbolic enumerations defined in the event's format file).

Running the autogenerated script and displaying the printed events is
only marginally useful; at the least it can be used to see what's
there and verify that things basically work.

One of the other things that might be marginally useful might be to
remove all the handlers from the generated script and replace it with
a single handler for the 'unhandled_trace' event which is available to
every script. Here's a complete script for doing that:

# perf trace event handlers, generated by perf trace -p -g

# The common_* event handler fields are the most useful fields common to
# all events. They don't necessarily correspond to the 'common_*' fields
# in the status files. Those fields not available as handler params can
# be retrieved via script functions of the form get_common_*().

use lib "$ENV{'PERF_EXEC_PATH'}/scripts/perl/Perf-Trace-Util/lib";
use lib "./Perf-Trace-Util/lib";
use Perf::Trace::Core;
use Perf::Trace::Util;

my %unhandled;

sub trace_end
{
print_unhandled();
}

sub trace_unhandled
{
my ($event_name, $common_cpu, $common_secs, $common_nsecs, $common_pid,
$common_comm) = @_;

$unhandled{$event_name}++;
}

sub print_unhandled
{
if ((scalar keys %unhandled) == 0) {
return;
}

print "\nunhandled events:\n\n";

printf("%-40s %10s\n", "event", "count");
printf("%-40s %10s\n", "----------------------------------------",
"-----------");

foreach my $event_name (keys %unhandled) {
printf("%-40s %10d\n", $event_name, $unhandled{$event_name});
}
}

For every unhandled event (every event for this script, since we
removed all the real handlers), the total counts for each are hashed
by event name and printed out at the end of the trace (in the
'trace_end' handler, which can also be defined for each script.
Here's some example output for this script:

root@tropicana:~# perf trace -p -s perf-trace-unhandled.pl
perf trace started with Perl script perf-trace.pl


unhandled events:

event count
---------------------------------------- -----------
kmem::kfree 927
irq::softirq_entry 288
kmem::kmalloc 929
sched::sched_switch 222
irq::softirq_exit 288

perf trace Perl script stopped

The main point of being able to script the trace stream, however, is
to use the power of the scripting language to do more useful analyses.
One simple example would be to display the total r/w activity for all
processes on the system:

root@tropicana:~# perf record -c 1 -f -a -R -e syscalls:sys_enter_read -e syscalls:sys_exit_read -e syscalls:sys_enter_write -e syscalls:sys_exit_write
^C[ perf record: Woken up 0 times to write data ]
[ perf record: Captured and wrote 8.540 MB perf.data (~373127 samples) ]

root@tropicana:~# perf trace -p -s libexec/perf-core/scripts/perl/rw-by-pid.pl
perf trace started with Perl script libexec/perf-core/scripts/perl/rw-by-pid.pl

read counts by pid:

pid comm # reads bytes_requested bytes_read
------ -------------------- ----------- ---------- ----------
8565 perf 1886 1928288 1683584
5520 Xorg 54 1296 1296
6242 gvfsd-trash 2 4096 194
6637 wterm 7 28672 64
6225 gnome-screensav 8 32768 64
6202 nautilus 5 16384 48
6169 gnome-settings- 3 12288 32
6200 gnome-panel 1 4096

failed reads by pid:

pid comm error # # errors
------ -------------------- ------ ----------
6202 nautilus -11 4
6169 gnome-settings- -11 2
6200 gnome-panel -11 1
6637 wterm -11 5
8565 perf 0 144
6225 gnome-screensav -11 6
6242 gvfsd-trash -11 1

write counts by pid:

pid comm # writes bytes_written
------ -------------------- ----------- ----------
8565 perf 122604 20204544
6242 gvfsd-trash 2 93
8566 gvfsd-trash 1 35
6637 wterm 1 1

failed writes by pid:

pid comm error # # errors
------ -------------------- ------ ----------

perf trace Perl script stopped

The above output shows sorted tables of reads and write activity, and
also tracks counts of each different read/write return error seen by
each process.

We can use the results of a higher-level trace analysis like the one
above to create other scripts that drill down to get more detailed
output. For example, the above results show that perf itself was
responsible for a lot of reading and writing. To see more detail
about which files it's reading from/writing to, we can write and run
another script:

root@tropicana:~# perf trace -p -s libexec/perf-core/scripts/perl/rw-by-file.pl
perf trace started with Perl script libexec/perf-core/scripts/perl/rw-by-file.pl

file read counts for perf:

fd # reads bytes_requested
------ ---------- -----------
19 1883 1928192
3 22 90112
30 2 4096
13 2 4096
12 54 1296
16 1 32
17 1 32
15 1 32

file write counts for perf:

fd # writes bytes_written
------ ---------- -----------
3 122604 20204544
1 3 128
4 1 1

unhandled events:

event count
---------------------------------------- -----------
syscalls::sys_exit_read 1965
syscalls::sys_exit_write 122608

perf trace Perl script stopped

>From the above, we see that most of the reads are coming from fd 19
and most of the writes are going to fd 3. At this point, we don't
have any further information about what those files actually are - to
do that we'd need to be able to get it from either the open() system
call tracepoint (preferably, but the filename isn't yet available from
that source) or from reading /proc during the trace.

Actually, we probably don't really event want to see the events from
perf itself, so should also have a way to filter out anything from
perf in the kernel, another todo item.

Another simple example could be to use a script to track wakeup
latencies:

root@tropicana:~# perf record -c 1 -f -a -R -e sched:sched_switch -e sched:sched_wakeup
^C[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.375 MB perf.data (~16379 samples) ]

root@tropicana:~# perf trace -p -s libexec/perf-core/scripts/perl/wakeup-latency.pl

perf trace started with Perl script libexec/perf-core/scripts/perl/wakeup-latency.pl

wakeup_latency stats:

total_wakeups: 323
avg_wakeup_latency (ns): 12844
min_wakeup_latency (ns): 2229
max_wakeup_latency (ns): 360072

perf trace Perl script stopped

This script tracks the time differences between sched_wakeup and
sched_switch events and prints out the totals when the script is done.

One of the nice things about using trace data to do this kind of
analysis is that the individual trace records can be displayed at any
time, so for instance if one was interested in seeing exactly which
events caused the min an max latencies displayed in the output, the
individual events could be viewed by simply uncommenting the
auto-generated print statements.

See the wakeup-latency.pl script for details.

Finally, here's another example that displays workqueue stats for a
trace run:

root@tropicana:~# perf record -c 1 -f -a -R -e workqueue:workqueue_creation -e workqueue:workqueue_destruction -e workqueue:workqueue_execution -e workqueue:workqueue_insertion
^C[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.312 MB perf.data (~13631 samples) ]

root@tropicana:~# perf trace -p -s libexec/perf-core/scripts/perl/workqueue-stats.pl
perf trace started with Perl script libexec/perf-core/scripts/perl/workqueue-stats.pl

workqueue work stats:

cpu ins exec name
--- --- ---- ----
0 99 99 kondemand/0
0 33 33 events/0
0 10 10 ata/0
1 30 30 kblockd/1
1 102 102 ata/1
1 35 35 events/1

workqueue lifecycle stats:

cpu created destroyed name
--- ------- --------- ----

perf trace Perl script stopped

See the workqueue-stats.pl script for details.


Adding support for a new language
---------------------------------

It should be relatively easy to add support for a new language,
especially if the language implementation supports an interface
allowing an interpreter to be 'embedded' inside another program (in
this case the containing program will be 'perf trace') and have
user-defined script-level functions invoked with the parameters
associated with an event. The event and field type information
exported by the event tracing infrastructure (via the event 'format'
files) should be enough to parse and send any piece of trace data to
the user script. The easiest way to see how this can be done would be
to look at the Perl implementation contained in
perf/util/trace-event-perl.c/.h.

The minimum that needs to be done do add support for a new language is
to implement the trace_scripting_operations interface:

start_script()
stop_script()
process_event()
generate_event_handlers()

start_script() is called before any events are processed, and is meant
to give the scripting language support an opportunity to set things up
to receive events e.g. create and initialize an instance of a language
interpreter.

stop_script() is called after all events are processed, and is meant
to give the scripting language support an opportunity to clean up
e.g. destroy the interpreter instance, etc.

process_event() is called once for each event and takes as its main
parameter a pointer to the binary trace event record to be processed.
The implementation is responsible for picking out the binary fields
from the event record and sending them to the script handler function
associated with that event e.g. a function derived from the event name
it's meant to handle e.g. 'sched::sched_switch()'. It's probably not
as much work as you'd think to parse the binary data and map it into a
form usable by a given scripting language - on the 'perf trace' side
you have extensive information about the field data types and on the
scripting language side you probably only have a few data types to
worry about, since dynamically typed languages are typically pretty
loose about typing e.g. for the Perl implementation, only 3
type-related interpreter functions were used, one each for signed,
unsigned, and string variables.

generate_event_handlers() should generate a ready-to-run script for
the current set of events in the trace, preferably with bodies that
print out every fields for every event. Again, look at the Perl
implementation for clues as to how that can be done. This is an
optional, but very useful op.

There are a couple of other things that aren't covered by the
trace_scripting_operations because different implementations might
want to do things differently, but that should be implemented anyway.
One of these is support for 'flag' and 'symbolic' fields e.g. being
able to use more human-readable values such as 'GFP_KERNEL' or
HI/BLOCK_IOPOLL/TASKLET in place of raw flag values. See the Perl
implementation to see how this might be done. Again, it shouldn't be
too painful if you can follow that example.

The final thing would be to provide a library of functions that script
writers could call to get trace meta-information or unpassed fields,
but even the Perl implementation hasn't gotten to that yet, so there's
nothing to look at there...

Thanks,

Tom


Tom Zanussi (9):
tracing/events: Add 'signed' field to format files
perf trace: Add subsystem string to struct event
perf trace: Add string/dynamic cases to format_flags
perf trace: Add trace scripting ops
perf trace: Add Perl scripting support
perf trace: Add scripting op for generating empty event handling
scripts
perf trace: Add FIELD_IS_FLAG/SYMBOLIC cases to format_flags
perf trace: Add perf trace scripting support modules for Perl
perf trace: Add throwaway timestamp sorting

include/trace/ftrace.h | 15 +-
kernel/trace/ring_buffer.c | 15 +-
kernel/trace/trace_events.c | 24 +-
kernel/trace/trace_export.c | 25 +-
kernel/trace/trace_syscalls.c | 20 +-
tools/perf/Makefile | 25 +
tools/perf/builtin-trace.c | 610 +++++++++++++++++++-
tools/perf/scripts/perl/Perf-Trace-Util/Changes | 6 +
tools/perf/scripts/perl/Perf-Trace-Util/MANIFEST | 6 +
.../perf/scripts/perl/Perf-Trace-Util/Makefile.PL | 12 +
tools/perf/scripts/perl/Perf-Trace-Util/README | 30 +
.../perl/Perf-Trace-Util/lib/Perf/Trace/Core.pm | 165 ++++++
.../perl/Perf-Trace-Util/lib/Perf/Trace/Util.pm | 97 +++
.../perl/Perf-Trace-Util/t/Perf-Trace-Util.t | 15 +
tools/perf/scripts/perl/rw-by-file.pl | 95 +++
tools/perf/scripts/perl/rw-by-pid.pl | 151 +++++
tools/perf/scripts/perl/wakeup-latency.pl | 94 +++
tools/perf/scripts/perl/workqueue-stats.pl | 121 ++++
tools/perf/util/trace-event-parse.c | 87 +++-
tools/perf/util/trace-event-perl.c | 514 +++++++++++++++++
tools/perf/util/trace-event-perl.h | 37 ++
tools/perf/util/trace-event.h | 23 +-
22 files changed, 2135 insertions(+), 52 deletions(-)
create mode 100644 tools/perf/scripts/perl/Perf-Trace-Util/Changes
create mode 100644 tools/perf/scripts/perl/Perf-Trace-Util/MANIFEST
create mode 100644 tools/perf/scripts/perl/Perf-Trace-Util/Makefile.PL
create mode 100644 tools/perf/scripts/perl/Perf-Trace-Util/README
create mode 100644 tools/perf/scripts/perl/Perf-Trace-Util/lib/Perf/Trace/Core.pm
create mode 100644 tools/perf/scripts/perl/Perf-Trace-Util/lib/Perf/Trace/Util.pm
create mode 100644 tools/perf/scripts/perl/Perf-Trace-Util/t/Perf-Trace-Util.t
create mode 100644 tools/perf/scripts/perl/rw-by-file.pl
create mode 100644 tools/perf/scripts/perl/rw-by-pid.pl
create mode 100644 tools/perf/scripts/perl/wakeup-latency.pl
create mode 100644 tools/perf/scripts/perl/workqueue-stats.pl
create mode 100644 tools/perf/util/trace-event-perl.c
create mode 100644 tools/perf/util/trace-event-perl.h

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/