Re: [PATCH 01/13] perf_events: add generic taken branch samplingsupport (v3)

From: Stephane Eranian
Date: Fri Jan 27 2012 - 05:06:11 EST


On Fri, Jan 27, 2012 at 5:46 AM, Anshuman Khandual
<khandual@xxxxxxxxxxxxxxxxxx> wrote:
> On Monday 09 January 2012 10:19 PM, Stephane Eranian wrote:
>> This patch adds the ability to sample taken branches to the
>> perf_event interface.
>>
>> The ability to capture taken branches is very useful for all
>> sorts of analysis. For instance, basic block profiling, call
>> counts, statistical call graph.
>>
>> This new capability requires hardware assist and as such may
>> not be available on all HW platforms. On Intel X86, it is
>> implemented on top of the Last Branch Record (LBR) facility.
>>
>> To enable taken branches sampling, the PERF_SAMPLE_BRANCH_STACK
>> bit must be set in attr->sample_type.
>>
>> Sampled taken branches may be filtered by type and/or priv
>> levels.
>>
>> The patch adds a new field, called branch_sample_type, to the
>> perf_event_attr structure. It contains a bitmask of filters
>> to apply to the sampled taken branches.
>>
>> Filters may be implemented in HW. If the HW filter does not exist
>> or is not good enough, some arch may also implement a SW filter.
>>
>> The following generic filters are currently defined:
>> - PERF_SAMPLE_USER
>> Â only branches whose targets are at the user level
>>
>> - PERF_SAMPLE_KERNEL
>> Â only branches whose targets are at the kernel level
>>
>> - PERF_SAMPLE_ANY
>> Â any type of branches (subject to priv levels filters)
>>
>> - PERF_SAMPLE_ANY_CALL
>> Â any call branches (may incl. syscall on some arch)
>>
>> - PERF_SAMPLE_ANY_RET
>> Â any return branches (may incl. syscall returns on some arch)
>>
>> - PERF_SAMPLE_IND_CALL
>> Â indirect call branches
>>
>> Obviously filter may be combined. The priv level bits are optional.
>> If not provided, the priv level of the associated event are used. It
>> is possible to collect branches at a priv level different from the
>> associated event.
>>
>> The number of taken branch records present in each sample may vary based
>> on HW, the type of sampled branches, the executed code. Therefore
>> each sample contains the number of taken branches it contains.
>>
>> Signed-off-by: Stephane Eranian <eranian@xxxxxxxxxx>
> ÂReviewed by: Anshuman Khandual <khandual@xxxxxxxxxxxxxxxxxx>
>> ---
>> Âarch/x86/kernel/cpu/perf_event_intel_lbr.c | Â 21 +++++---
>> Âinclude/linux/perf_event.h         |  66 ++++++++++++++++++++++++++--
>> Âkernel/events/core.c            |  58 ++++++++++++++++++++++++
>> Â3 files changed, 133 insertions(+), 12 deletions(-)
>>
>> diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
>> index 3fab3de..c3f8100 100644
>> --- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
>> +++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
>> @@ -144,9 +144,11 @@ static void intel_pmu_lbr_read_32(struct cpu_hw_events *cpuc)
>>
>> Â Â Â Â Â Â Â rdmsrl(x86_pmu.lbr_from + lbr_idx, msr_lastbranch.lbr);
>>
>> - Â Â Â Â Â Â cpuc->lbr_entries[i].from Â= msr_lastbranch.from;
>> -       cpuc->lbr_entries[i].to  Â= msr_lastbranch.to;
>> - Â Â Â Â Â Â cpuc->lbr_entries[i].flags = 0;
>> +       cpuc->lbr_entries[i].from    = msr_lastbranch.from;
>> +       cpuc->lbr_entries[i].to     = msr_lastbranch.to;
>> +       cpuc->lbr_entries[i].mispred  Â= 0;
>> + Â Â Â Â Â Â cpuc->lbr_entries[i].predicted Â= 0;
>> +       cpuc->lbr_entries[i].reserved  = 0;
>> Â Â Â }
>> Â Â Â cpuc->lbr_stack.nr = i;
>> Â}
>> @@ -167,19 +169,22 @@ static void intel_pmu_lbr_read_64(struct cpu_hw_events *cpuc)
>>
>> Â Â Â for (i = 0; i < x86_pmu.lbr_nr; i++) {
>> Â Â Â Â Â Â Â unsigned long lbr_idx = (tos - i) & mask;
>> - Â Â Â Â Â Â u64 from, to, flags = 0;
>> + Â Â Â Â Â Â u64 from, to, mis = 0, pred = 0;
>>
>> Â Â Â Â Â Â Â rdmsrl(x86_pmu.lbr_from + lbr_idx, from);
>>        rdmsrl(x86_pmu.lbr_to  + lbr_idx, to);
>>
>> Â Â Â Â Â Â Â if (lbr_format == LBR_FORMAT_EIP_FLAGS) {
>> - Â Â Â Â Â Â Â Â Â Â flags = !!(from & LBR_FROM_FLAG_MISPRED);
>> + Â Â Â Â Â Â Â Â Â Â mis = !!(from & LBR_FROM_FLAG_MISPRED);
>> + Â Â Â Â Â Â Â Â Â Â pred = !mis;
>> Â Â Â Â Â Â Â Â Â Â Â from = (u64)((((s64)from) << 1) >> 1);
>> Â Â Â Â Â Â Â }
>>
>> - Â Â Â Â Â Â cpuc->lbr_entries[i].from Â= from;
>> -       cpuc->lbr_entries[i].to  Â= to;
>> - Â Â Â Â Â Â cpuc->lbr_entries[i].flags = flags;
>> +       cpuc->lbr_entries[i].from    = from;
>> +       cpuc->lbr_entries[i].to     = to;
>> +       cpuc->lbr_entries[i].mispred  Â= mis;
>> + Â Â Â Â Â Â cpuc->lbr_entries[i].predicted Â= pred;
>> +       cpuc->lbr_entries[i].reserved  = 0;
>> Â Â Â }
>> Â Â Â cpuc->lbr_stack.nr = i;
>> Â}
>> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
>> index 0b91db2..17751b1 100644
>> --- a/include/linux/perf_event.h
>> +++ b/include/linux/perf_event.h
>> @@ -129,11 +129,38 @@ enum perf_event_sample_format {
>> Â Â Â PERF_SAMPLE_PERIOD Â Â Â Â Â Â Â Â Â Â Â= 1U << 8,
>> Â Â Â PERF_SAMPLE_STREAM_ID Â Â Â Â Â Â Â Â Â = 1U << 9,
>> Â Â Â PERF_SAMPLE_RAW Â Â Â Â Â Â Â Â Â Â Â Â = 1U << 10,
>> + Â Â PERF_SAMPLE_BRANCH_STACK Â Â Â Â Â Â Â Â= 1U << 11,
>>
>> - Â Â PERF_SAMPLE_MAX = 1U << 11, Â Â Â Â Â Â /* non-ABI */
>> + Â Â PERF_SAMPLE_MAX = 1U << 12, Â Â Â Â Â Â /* non-ABI */
>> Â};
>>
>> Â/*
>> + * values to program into branch_sample_type when PERF_SAMPLE_BRANCH is set
>> + *
>> + * If the user does not pass priv level information via branch_sample_type,
>> + * the kernel uses the event's priv level. Branch and event priv levels do
>> + * not have to match. Branch priv level is checked for permissions.
>> + *
>> + * The branch types can be combined, however BRANCH_ANY covers all types
>> + * of branches and therefore it supersedes all the other types.
>> + */
>> +enum perf_branch_sample_type {
>> + Â Â PERF_SAMPLE_BRANCH_USER Â Â Â Â = 1U << 0, /* user level branches */
>> + Â Â PERF_SAMPLE_BRANCH_KERNEL Â Â Â = 1U << 1, /* kernel level branches */
>> +
>> + Â Â PERF_SAMPLE_BRANCH_ANY Â Â Â Â Â= 1U << 2, /* any branch types */
>> + Â Â PERF_SAMPLE_BRANCH_ANY_CALL Â Â = 1U << 3, /* any call branch */
>> + Â Â PERF_SAMPLE_BRANCH_ANY_RETURN Â = 1U << 4, /* any return branch */
>> + Â Â PERF_SAMPLE_BRANCH_IND_CALL Â Â = 1U << 5, /* indirect calls */
>> +
>> + Â Â PERF_SAMPLE_BRANCH_MAX Â Â Â Â Â= 1U << 6,/* non-ABI */
>> +};
>> +
>> +#define PERF_SAMPLE_BRANCH_PLM_ALL \
>> + Â Â (PERF_SAMPLE_BRANCH_USER|\
>> + Â Â ÂPERF_SAMPLE_BRANCH_KERNEL)
>> +
>> +/*
>> Â * The format of the data returned by read() on a perf event fd,
>> Â * as specified by attr.read_format:
>> Â *
>> @@ -240,6 +267,7 @@ struct perf_event_attr {
>> Â Â Â Â Â Â Â __u64 Â Â Â Â Â bp_len;
>> Â Â Â Â Â Â Â __u64 Â Â Â Â Â config2; /* extension of config1 */
>> Â Â Â };
>> + Â Â __u64 Â branch_sample_type; /* enum branch_sample_type */
>> Â};
>>
>> Â/*
>> @@ -458,6 +486,8 @@ enum perf_event_type {
>> Â Â Â Â*
>> Â Â Â Â* Â Â Â{ u32 Â Â Â Â Â Â Â Â Â size;
>>    Â*    Âchar         Âdata[size];}&& PERF_SAMPLE_RAW
>> + Â Â Â*
>> + Â Â Â* Â Â Â{ u64 from, to, flags } lbr[nr];} && PERF_SAMPLE_BRANCH_STACK
>> Â Â Â Â* };
>> Â Â Â Â*/
>> Â Â Â PERF_RECORD_SAMPLE Â Â Â Â Â Â Â Â Â Â Â= 9,
>> @@ -530,12 +560,31 @@ struct perf_raw_record {
>>    void              Â*data;
>> Â};
>>
>> +/*
>> + * single taken branch record layout:
>> + *
>> + * Â Â Âfrom: source instruction (may not always be a branch insn)
>> + * Â Â Â Âto: branch target
>> + * Â mispred: branch target was mispredicted
>> + * predicted: branch target was predicted
>> + *
>> + * support for mispred, predicted is optional. In case it
>> + * is not supported mispred = predicted = 0.
>> + */
> So the user level perf tools would check for ((mispred = 0) && (predicted = 0))
> in a sample and report that its not supported by the HW PMU ? Point here is
> that if its not supported we should say Â"No HW support" rather than displaying
> mispred = 0 and predicted = 0 (As this could be misleading)
>> Âstruct perf_branch_entry {
>> - Â Â __u64 Â Â Â Â Â Â Â Â Â Â Â Â Â from;
>> - Â Â __u64 Â Â Â Â Â Â Â Â Â Â Â Â Â to;
>> - Â Â __u64 Â Â Â Â Â Â Â Â Â Â Â Â Â flags;
>> + Â Â __u64 Â from;
>> + Â Â __u64 Â to;
>> + Â Â __u64 Â mispred:1, Â/* target mispredicted */
>> + Â Â Â Â Â Â predicted:1,/* target predicted */
>> + Â Â Â Â Â Â reserved:62;
>> Â};
>>
>> +/*
>> + * branch stack layout:
>> + * Ânr: number of taken branches stored in entries[]
>> + *
>> + * Note that nr can vary from sample to sample
>> + */
>> Âstruct perf_branch_stack {
>> Â Â Â __u64 Â Â Â Â Â Â Â Â Â Â Â Â Â nr;
>>    struct perf_branch_entry    Âentries[0];
>> @@ -566,7 +615,9 @@ struct hw_perf_event {
>>            unsigned long  event_base;
>>            int       idx;
>>            int       last_cpu;
>> +
>> Â Â Â Â Â Â Â Â Â Â Â struct hw_perf_event_extra extra_reg;
>> + Â Â Â Â Â Â Â Â Â Â struct hw_perf_event_extra branch_reg;
>> Â Â Â Â Â Â Â };
>> Â Â Â Â Â Â Â struct { /* software */
>> Â Â Â Â Â Â Â Â Â Â Â struct hrtimer Âhrtimer;
>> @@ -1003,12 +1054,14 @@ struct perf_sample_data {
>> Â Â Â u64 Â Â Â Â Â Â Â Â Â Â Â Â Â Â period;
>>    struct perf_callchain_entry   *callchain;
>>    struct perf_raw_record     Â*raw;
>> +   struct perf_branch_stack    Â*br_stack;
>> Â};
>>
>> Âstatic inline void perf_sample_data_init(struct perf_sample_data *data, u64 addr)
>> Â{
>> Â Â Â data->addr = addr;
>> Â Â Â data->raw Â= NULL;
>> + Â Â data->br_stack = NULL;
>> Â}
>>
>> Âextern void perf_output_sample(struct perf_output_handle *handle,
>> @@ -1147,6 +1200,11 @@ extern void perf_bp_event(struct perf_event *event, void *data);
>> Â# define perf_instruction_pointer(regs) Â Â Âinstruction_pointer(regs)
>> Â#endif
>>
>> +static inline bool has_branch_stack(struct perf_event *event)
>> +{
>> + Â Â return event->attr.sample_type & PERF_SAMPLE_BRANCH_STACK;
>> +}
>> +
>> Âextern int perf_output_begin(struct perf_output_handle *handle,
>> Â Â Â Â Â Â Â Â Â Â Â Â Â Âstruct perf_event *event, unsigned int size);
>> Âextern void perf_output_end(struct perf_output_handle *handle);
>> diff --git a/kernel/events/core.c b/kernel/events/core.c
>> index 91fb68a..ed39225 100644
>> --- a/kernel/events/core.c
>> +++ b/kernel/events/core.c
>> @@ -3877,6 +3877,24 @@ void perf_output_sample(struct perf_output_handle *handle,
>> Â Â Â Â Â Â Â Â Â Â Â }
>> Â Â Â Â Â Â Â }
>> Â Â Â }
>> +
>> + Â Â if (sample_type & PERF_SAMPLE_BRANCH_STACK) {
>> + Â Â Â Â Â Â if (data->br_stack) {
>> + Â Â Â Â Â Â Â Â Â Â size_t size;
>> +
>> + Â Â Â Â Â Â Â Â Â Â size = data->br_stack->nr
>> + Â Â Â Â Â Â Â Â Â Â Â Â Â* sizeof(struct perf_branch_entry);
>> +
>> + Â Â Â Â Â Â Â Â Â Â perf_output_put(handle, data->br_stack->nr);
>> + Â Â Â Â Â Â Â Â Â Â perf_output_copy(handle, data->br_stack->entries, size);
>> + Â Â Â Â Â Â } else {
>> + Â Â Â Â Â Â Â Â Â Â /*
>> + Â Â Â Â Â Â Â Â Â Â Â* we always store at least the value of nr
>> + Â Â Â Â Â Â Â Â Â Â Â*/
>> + Â Â Â Â Â Â Â Â Â Â u64 nr = 0;
>> + Â Â Â Â Â Â Â Â Â Â perf_output_put(handle, nr);
>> + Â Â Â Â Â Â }
>> + Â Â }
>> Â}
>>
>> Âvoid perf_prepare_sample(struct perf_event_header *header,
>> @@ -3919,6 +3937,15 @@ void perf_prepare_sample(struct perf_event_header *header,
>> Â Â Â Â Â Â Â WARN_ON_ONCE(size & (sizeof(u64)-1));
>> Â Â Â Â Â Â Â header->size += size;
>> Â Â Â }
>> +
>> + Â Â if (sample_type & PERF_SAMPLE_BRANCH_STACK) {
>> + Â Â Â Â Â Â int size = sizeof(u64); /* nr */
>> + Â Â Â Â Â Â if (data->br_stack) {
>> + Â Â Â Â Â Â Â Â Â Â size += data->br_stack->nr
>> + Â Â Â Â Â Â Â Â Â Â Â Â Â * sizeof(struct perf_branch_entry);
>> + Â Â Â Â Â Â }
>> + Â Â Â Â Â Â header->size += size;
>> + Â Â }
>> Â}
>>
>> Âstatic void perf_event_output(struct perf_event *event,
>> @@ -5898,6 +5925,37 @@ static int perf_copy_attr(struct perf_event_attr __user *uattr,
>> Â Â Â if (attr->read_format & ~(PERF_FORMAT_MAX-1))
>> Â Â Â Â Â Â Â return -EINVAL;
>>
>> + Â Â if (attr->sample_type & PERF_SAMPLE_BRANCH_STACK) {
>> + Â Â Â Â Â Â u64 mask = attr->branch_sample_type;
>> +
>> + Â Â Â Â Â Â /* only using defined bits */
>> + Â Â Â Â Â Â if (mask & ~(PERF_SAMPLE_BRANCH_MAX-1))
>> + Â Â Â Â Â Â Â Â Â Â return -EINVAL;
>> +
>> + Â Â Â Â Â Â /* at least one branch bit must be set */
>> + Â Â Â Â Â Â if (!(mask & ~PERF_SAMPLE_BRANCH_PLM_ALL))
>> + Â Â Â Â Â Â Â Â Â Â return -EINVAL;
>> +
>> + Â Â Â Â Â Â /* kernel level capture */
>> + Â Â Â Â Â Â if ((mask & PERF_SAMPLE_BRANCH_KERNEL)
>> + Â Â Â Â Â Â Â Â && perf_paranoid_kernel() && !capable(CAP_SYS_ADMIN))
>> + Â Â Â Â Â Â Â Â Â Â return -EACCES;
>> +
>> + Â Â Â Â Â Â /* propagate priv level, when not set for branch */
>> + Â Â Â Â Â Â if (!(mask & PERF_SAMPLE_BRANCH_PLM_ALL)) {
>> +
>> + Â Â Â Â Â Â Â Â Â Â /* exclude_kernel checked on syscall entry */
>> + Â Â Â Â Â Â Â Â Â Â if (!attr->exclude_kernel)
>> + Â Â Â Â Â Â Â Â Â Â Â Â Â Â mask |= PERF_SAMPLE_BRANCH_KERNEL;
>> +
>> + Â Â Â Â Â Â Â Â Â Â if (!attr->exclude_user)
>> + Â Â Â Â Â Â Â Â Â Â Â Â Â Â mask |= PERF_SAMPLE_BRANCH_USER;
> Why we are not taking care for attr->exclude_hv ? Should not we define
> PERF_SAMPLE_BRANCH_HV for hyper-visor level branches ?

Yes, we can add this, though I don't have any system to test it.
I will post a patch to add this priv level.

>> + Â Â Â Â Â Â Â Â Â Â /*
>> + Â Â Â Â Â Â Â Â Â Â Â* adjust user setting (for HW filter setup)
>> + Â Â Â Â Â Â Â Â Â Â Â*/
>> + Â Â Â Â Â Â Â Â Â Â attr->branch_sample_type = mask;
>> + Â Â Â Â Â Â }
>> + Â Â }
>> Âout:
>> Â Â Â return ret;
>>
>
>
> --
> Linux Technology Centre
> IBM Systems and Technology Group
> Bangalore India
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/