[PATCH v3 0/3] perf: Add AUX data sampling

From: Alexander Shishkin
Date: Fri Oct 25 2019 - 10:08:54 EST

Hi Peter,

Here's another version of the AUX sampling, addressing all the comments
from the previous one [5]: fixed group leader refcount leak if both
aux_source and aux_sample_size are set, changed aux_sample_size to u32
in the ABI and removed the pointless sample init bit. Also dropped 4/4
from this series, will send separately. This one has a context dependency
on the attr.__reserved_2 fix [6], as it adds one more reserved bit.

Changes since version one [3]: it addresses the issues of NMI-safety
and sampling hardware events. The former is addressed by adding a new
PMU callback, the latter by making use of grouping. It also depends
on the AUX output stop fix [4] to work correctly. I decided to post
them separately, because [4] is also a candidate for perf/urgent.

This series introduces AUX data sampling for perf events, which in
case of our instruction/branch tracing PMUs like Intel PT, BTS, CS
ETM means execution flow history leading up to a perf event's

In case of Intel PT, this can be used as an alternative to LBR, with
virtually as many as you like branches per sample. It doesn't support
some of the LBR features (branch prediction indication, basic block
level timing, etc [1]) and it can't be exposed as BRANCH_STACK, because
that would require decoding PT stream in kernel space, which is not
practical. Instead, we deliver the PT data to userspace as is, for
offline processing. The PT decoder already supports presenting PT as
virtual LBR.

AUX sampling is different from the snapshot mode in that it doesn't
require instrumentation (for when to take a snapshot) and is better
for generic data collection, when you don't yet know what you are
looking for. It's also useful for automated data collection, for
example, for feedback-driven compiler optimizaitions.

It's also different from the "full trace mode" in that it produces
much less data and, consequently, takes up less I/O bandwidth and
storage space, and takes less time to decode.

The bulk of code is in 1/4, which adds the user interface bits and
the code to measure and copy out AUX data. 3/4 adds PT side support
for sampling. 4/4 is not strictly related, but makes an improvement
to the PT's snapshot mode by implementing a simpler buffer management
that would also benefit the sampling.

The tooling support is ready, although I'm not including it here to
save the bandwidth. Adrian or I will post it separately. Meanwhile,
it can be found here [2], updated to reflect the ABI change.

[1] https://marc.info/?l=linux-kernel&m=147467007714928&w=2
[2] https://git.kernel.org/cgit/linux/kernel/git/ash/linux.git/log/?h=perf-aux-sampling
[3] https://marc.info/?l=linux-kernel&m=152878999928771
[4] https://marc.info/?l=linux-kernel&m=157172999231707
[5] https://marc.info/?l=linux-kernel&m=157173832302445
[6] https://marc.info/?l=linux-kernel&m=157200581818800

Alexander Shishkin (3):
perf: Allow using AUX data in perf samples
perf/x86/intel/pt: Factor out starting the trace
perf/x86/intel/pt: Add sampling support

arch/x86/events/intel/pt.c | 76 ++++++++++++--
include/linux/perf_event.h | 19 ++++
include/uapi/linux/perf_event.h | 10 +-
kernel/events/core.c | 172 +++++++++++++++++++++++++++++++-
kernel/events/internal.h | 1 +
kernel/events/ring_buffer.c | 36 +++++++
6 files changed, 303 insertions(+), 11 deletions(-)