[RFC PATCH 00/17] perf: Detached events
From: Alexander Shishkin
Date: Tue Sep 05 2017 - 09:30:46 EST
Hi,
I'm going to keep this short.
Objective: include perf data (specifically, AUX/Intel PT) in process core
dumps.
Obstacles and how this patchset deals with them:
(1) Need to be able to have perf events running without consumer (perf
record) running in the background.
Detached events: a new flag to the perf syscall makes a 'detached' event,
which exists after its file descriptor is released. Not all detached events
are per-thread AUX events: this tries to take into account the need for
system-wide persistent events too.
(2) Need to be able to kill those events, so they need to be accessible
after they are created.
Event files: detached events exist as files in tracefs (at the moment), can
be opened/mmaped/read/removed.
(3) Ring buffer contents from these events needs to end up in the core dump
file.
Injecting perf ring buffer into the target task's address space.
(4) Inheritance will have to allocate ring buffers for such events for this
feature to be useful.
A parentless detached event is created (with a ring buffer) upon
inheritance, no output redirection, each event has its own ring buffer.
(5) Sideeffect of (4) is that we can't use GFP_KERNEL pages for such ring
buffers or else we'll have to fail inherit_event() (and, therefore, user's
fork()) when they exhaust their mlock limit.
Using shmemfs-backed pages for such a ring buffer and only pinning them
while the corresponding target task is running. Other times these pages can
be swapped out.
(6) Ring buffer memory accounting needs to take this new arrangement into
account: one user can use up at most NR_CPUS * buffer_size memory at any
given point in time.
Only account the first such event and undo the accounting when the last
event is gone.
(7) We'll also need to supply all the things that the [PT] decoder normally
finds out via sysfs attributes, like clock ratios, capabilities, etc so that
it also finds its way into the core dump file.
"PMU info" structure is appended to the user page.
I've also hack the perf tool to support all this, all these things can be
found at [1]. I'm not posting the tooling patches though, them being
thoroughly ugly and proof-of-concept. In short, perf record will create
detached events with '--detached' and afterwards will open detached events
via their path in tracefs.
[1] https://git.kernel.org/pub/scm/linux/kernel/git/ash/linux.git/log/?h=perf-detached-shmem-wip
Alexander Shishkin (17):
perf: Allow mmapping only user page
perf: Factor out mlock accounting
tracefs: De-globalize instances' callbacks
tracefs: Add ->unlink callback to tracefs_dir_ops
perf: Introduce detached events
perf: Add buffers to the detached events
perf: Add pmu_info to user page
perf: Allow inheritance for detached events
perf: Use shmemfs pages for userspace-only per-thread detached events
perf: Implement pinning and scheduling for SHMEM events
perf: Implement mlock accounting for shmem ring buffers
perf: Track pinned events per user
perf: Re-inject shmem buffers after exec
perf: Add ioctl(REATTACH) for detached events
perf: Allow controlled non-root access to detached events
perf/x86/intel/pt: Add PMU info
perf/x86/intel/bts: Add PMU info
arch/x86/events/intel/bts.c | 20 +-
arch/x86/events/intel/pt.c | 23 +-
arch/x86/events/intel/pt.h | 11 +
fs/tracefs/inode.c | 71 +++-
include/linux/perf_event.h | 33 ++
include/linux/sched/user.h | 6 +
include/linux/tracefs.h | 3 +-
include/uapi/linux/perf_event.h | 15 +
kernel/events/core.c | 526 +++++++++++++++++++++++------
kernel/events/internal.h | 27 +-
kernel/events/ring_buffer.c | 730 ++++++++++++++++++++++++++++++++++++--
kernel/trace/trace.c | 8 +-
kernel/user.c | 1 +
13 files changed, 1315 insertions(+), 159 deletions(-)
--
2.14.1