Re: [RFC][PATCH v2 06/11] perf: core, export pmus via sysfs
From: Peter Zijlstra
Date: Wed May 19 2010 - 03:15:35 EST
On Tue, 2010-05-18 at 19:48 -0700, Greg KH wrote:
> Again, why do you need/want anything in sysfs in the first place?
> What problem is it going to solve? Who is going to benifit? Why do
> they care? What is this whole thing about?
OK, so all of this is about perf_event. The story starts with CPUs
adding a PMU (Performance Monitor Unit) which allows the user to
count/sample cpu state.
The whole perf_counter subsystem was created to abstract this piece of
hardware and provide an kernel interface to it.
Then we realized that a generalization of the PMU exists in pretty much
everything that generates 'events' of interest and so we started adding
software PMUs that allowed us to do the same for tracepoints etc.
So we ended up with perf_events. A subsystem dedicated to counting
events and event based sampling.
Now the problem this patch set tries to solve; more hardware than the
CPU has such capabilities. There are memory controllers, bus controllers
and devices with similar capabilities.
So we need a way to identify and locate these things, and since sysfs
has the full machine topology in it, the idea was to represent these
things in sysfs as an event_source class.
Since the CPU and memory controllers are (assumed) symmetric on the
system, we get to add things like:
/sys/devices/system/cpu/cpu_event_source/
/sys/devices/system/node/node_event_source/
Devices like GPUs can do:
/sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/radeon_event_source/
Hooking them into sysfs at the proper device/machine topology location
allows us to quickly locate and identify these 'event_sources'.
Since all hardware wants to keep life interesting they all work
differently and programming PMUs is no different, they count different
things, have different ways to program them etc. But for each class
there is a useful subset of things that is pretty uniform.
CPU based PMUs all can count things like clock-cycles and instructions,
Memory controllers can count things like local/remote memory accesses
etc.
So each class has a number of actual events that are worthy of
abstracting. The idea was to place these events in the event_source,
like:
/sys/devices/system/cpu/cpu_event_source/cycles/
/sys/devices/system/cpu/cpu_event_source/instructions/
And then there are the software event_sources that expose kernel events
(through tracepoints), currently tracepoints live
in /debug/tracing/events/ (or /sys/kernel/debug/tracing/events/ for
those so inclined). But the above abstraction would suggest we expose
them similarly.
I'm not sure where we'd want them to live, we could add them to:
/sys/kernel/tracepoint_event_source/
and have them live there, but I'm open to alternatives :-)
[ With event_source's being a sysfs-class, we also get a nice flat
collection in /sys/class/event_source/ helping those who get lost
in the device topology, me :-) ]
The next issue seems to be the interface between this sysfs
representation and the perf_event syscall, how do we go about creating
an actual perf_event object from this rich sysfs event_source class
object.
The sys_perf_event_open() call takes a struct perf_event_attr pointer
which describes the event and its properties. The current event
classification goes through:
struct perf_event_attr {
__u32 type;
__u64 config;
...
};
So my initial idea was to let each event_source have a type_id and let
each of its events have a config field and read those and insert them in
your structure.
So we'd get:
/sys/devices/system/cpu/cpu_event_source/type_id
/sys/devices/system/cpu/cpu_event_source/instructions/config
cat those to get: .type = 0, .config = 1
(PERF_TYPE_HARDWARE:PERF_COUNT_HW_INSTRUCTIONS).
Then Ingo objected and said, if we need to open and read those file, you
might as well just open one file and pass the fd along, saves some
syscalls.
So you'd end up doing:
fd = open("/sys/devices/system/cpu/cpu_event_source/instructions/config");
attr->type = fd | PERF_TYPE_FD;
event_fd = perf_event_open(attr, ... );
close(fd);
>From that one fd we can find to which 'event_source' it belongs and what
particular config we need to use.
Plenty of opinions to be had on that I guess.
Anyway, this was the what, why and how of it.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/