[PATCH 10/12] x86, perf: Add Top Down events to Intel Core

From: Andi Kleen
Date: Tue Jan 19 2016 - 21:29:56 EST

Next message: Andi Kleen: "[PATCH 08/12] perf, tools, stat: Add extra output of counter values with -v"
Previous message: Andi Kleen: "[PATCH 12/12] x86, perf: Use new ht_on flag in HT leak workaround"
In reply to: Andi Kleen: "[PATCH 12/12] x86, perf: Use new ht_on flag in HT leak workaround"
Next in thread: Andi Kleen: "[PATCH 08/12] perf, tools, stat: Add extra output of counter values with -v"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

From: Andi Kleen <ak@xxxxxxxxxxxxxxx>

Add declarations for the events needed for TopDown to the
Intel big core CPUs starting with Sandy Bridge. We need
to report different values if HyperThreading is on or off.

The only thing this patch does is to export some events
in sysfs.

TopDown level 1 uses a set of abstracted metrics which
are generic to out of order CPU cores (although some
CPUs may not implement all of them):

topdown-total-slots Available slots in the pipeline
topdown-slots-issued Slots issued into the pipeline
topdown-slots-retired Slots successfully retired
topdown-fetch-bubbles Pipeline gaps in the frontend
topdown-recovery-bubbles Pipeline gaps during recovery
from misspeculation

A slot is a single operation in the CPU pipe line.

These metrics then allow to compute four useful metrics:
FrontendBound, BackendBound, Retiring, BadSpeculation.

The formulas to compute the metrics are generic, they
only change based on the availability on the abstracted
input values.

The kernel declares the events supported by the current
CPU and their scaling factors (such as the pipeline width)
and perf stat then computes the formulas based on the
available metrics. This is similar how existing
perf metrics, such as TSC metrics or IPC, are implemented.

This abstracts all CPU pipe line specific knowledge in the
kernel driver, but still avoids the need for larger scale perf
interface changes.

For HyperThreading the any bit is needed to get accurate
values when both threads are executing. This implies that
the events can only be collected as root or with
perf_event_paranoid=-1 for now.

Hyper Threading also requires averaging events from both
threads together (the CPU cannot measure them independently).

In perf stat this is already done by the per core mode. The
new .aggr-per-core attribute is added to the events, which
then forces perf stat to enable --per-core.

The basic scheme is based on the following paper:
Yasin,
A Top Down Method for Performance analysis and Counter architecture
ISPASS14
(pdf available via google)

v2: Rework scaling. Fix formulas for HyperThreading.
v3: Rename agg-per-core to aggr-per-core
Always set aggr-per-core to one to get same output for HT off.
v4: Separate between forced and advisory aggr-per-core
Signed-off-by: Andi Kleen <ak@xxxxxxxxxxxxxxx>
---
arch/x86/kernel/cpu/perf_event_intel.c | 74 ++++++++++++++++++++++++++++++++++
1 file changed, 74 insertions(+)

diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index a667078..5a562b8 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -230,9 +230,65 @@ struct attribute *nhm_events_attrs[] = {
NULL,
};

+/*
+ * TopDown events for Core.
+ *
+ * The events are all in slots, which is a free slot in a 4 wide
+ * pipeline. Some events are already reported in slots, for cycle
+ * events we multiply by the pipeline width (4).
+ *
+ * With Hyper Threading on, TopDown metrics are either summed or averaged
+ * between the threads of a core: (count_t0 + count_t1).
+ *
+ * For the average case the metric is always scaled to pipeline width,
+ * so we use factor 2 ((count_t0 + count_t1) / 2 * 4)
+ *
+ * We tell perf to aggregate per core by setting the .aggr-per-core
+ * attribute for the alias to 1 or 2. 2 means it has to be per
+ * core, while 1 means it is optional (but on by default for consistency)
+ */
+
+EVENT_ATTR_STR_HT(topdown-total-slots, td_total_slots,
+ "event=0x3c,umask=0x0", /* cpu_clk_unhalted.thread */
+ "event=0x3c,umask=0x0,any=1"); /* cpu_clk_unhalted.thread_any */
+EVENT_ATTR_STR_HT(topdown-total-slots.scale, td_total_slots_scale, "4", "2");
+EVENT_ATTR_STR_HT(topdown-total-slots.aggr-per-core, td_total_slots_pc,
+ "1", "2");
+EVENT_ATTR_STR(topdown-slots-issued, td_slots_issued,
+ "event=0xe,umask=0x1"); /* uops_issued.any */
+EVENT_ATTR_STR_HT(topdown-slots-issued.aggr-per-core, td_slots_issued_pc,
+ "1", "2");
+EVENT_ATTR_STR(topdown-slots-retired, td_slots_retired,
+ "event=0xc2,umask=0x2"); /* uops_retired.retire_slots */
+EVENT_ATTR_STR_HT(topdown-slots-retired.aggr-per-core,
+ td_slots_retired_pc, "1", "2");
+EVENT_ATTR_STR(topdown-fetch-bubbles, td_fetch_bubbles,
+ "event=0x9c,umask=0x1"); /* idq_uops_not_delivered_core */
+EVENT_ATTR_STR_HT(topdown-fetch-bubbles.aggr-per-core,
+ td_fetch_bubbles_pc, "1", "2");
+EVENT_ATTR_STR_HT(topdown-recovery-bubbles, td_recovery_bubbles,
+ "event=0xd,umask=0x3,cmask=1", /* int_misc.recovery_cycles */
+ "event=0xd,umask=0x3,cmask=1,any=1"); /* int_misc.recovery_cycles_any */
+EVENT_ATTR_STR_HT(topdown-recovery-bubbles.scale, td_recovery_bubbles_scale,
+ "4", "2");
+EVENT_ATTR_STR_HT(topdown-recovery-bubbles.aggr-per-core,
+ td_recovery_bubbles_pc, "1", "2");
+
struct attribute *snb_events_attrs[] = {
EVENT_PTR(mem_ld_snb),
EVENT_PTR(mem_st_snb),
+ EVENT_PTR(td_slots_issued),
+ EVENT_PTR(td_slots_issued_pc),
+ EVENT_PTR(td_slots_retired),
+ EVENT_PTR(td_slots_retired_pc),
+ EVENT_PTR(td_fetch_bubbles),
+ EVENT_PTR(td_fetch_bubbles_pc),
+ EVENT_PTR(td_total_slots),
+ EVENT_PTR(td_total_slots_scale),
+ EVENT_PTR(td_total_slots_pc),
+ EVENT_PTR(td_recovery_bubbles),
+ EVENT_PTR(td_recovery_bubbles_scale),
+ EVENT_PTR(td_recovery_bubbles_pc),
NULL,
};

@@ -3283,6 +3339,18 @@ static struct attribute *hsw_events_attrs[] = {
EVENT_PTR(cycles_ct),
EVENT_PTR(mem_ld_hsw),
EVENT_PTR(mem_st_hsw),
+ EVENT_PTR(td_slots_issued),
+ EVENT_PTR(td_slots_issued_pc),
+ EVENT_PTR(td_slots_retired),
+ EVENT_PTR(td_slots_retired_pc),
+ EVENT_PTR(td_fetch_bubbles),
+ EVENT_PTR(td_fetch_bubbles_pc),
+ EVENT_PTR(td_total_slots),
+ EVENT_PTR(td_total_slots_scale),
+ EVENT_PTR(td_total_slots_pc),
+ EVENT_PTR(td_recovery_bubbles),
+ EVENT_PTR(td_recovery_bubbles_scale),
+ EVENT_PTR(td_recovery_bubbles_pc),
NULL
};

@@ -3622,6 +3690,12 @@ __init int intel_pmu_init(void)
memcpy(hw_cache_extra_regs, skl_hw_cache_extra_regs, sizeof(hw_cache_extra_regs));
intel_pmu_lbr_init_skl();

+ /* INT_MISC.RECOVERY_CYCLES has umask 1 in Skylake */
+ event_attr_td_recovery_bubbles.event_str_noht =
+ "event=0xd,umask=0x1,cmask=1";
+ event_attr_td_recovery_bubbles.event_str_ht =
+ "event=0xd,umask=0x1,cmask=1,any=1";
+
x86_pmu.event_constraints = intel_skl_event_constraints;
x86_pmu.pebs_constraints = intel_skl_pebs_event_constraints;
x86_pmu.extra_regs = intel_skl_extra_regs;
--
2.4.3

Next message: Andi Kleen: "[PATCH 08/12] perf, tools, stat: Add extra output of counter values with -v"
Previous message: Andi Kleen: "[PATCH 12/12] x86, perf: Use new ht_on flag in HT leak workaround"
In reply to: Andi Kleen: "[PATCH 12/12] x86, perf: Use new ht_on flag in HT leak workaround"
Next in thread: Andi Kleen: "[PATCH 08/12] perf, tools, stat: Add extra output of counter values with -v"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]