[PATCH 1/4] x86, perf: Use a new PMU ack sequence on Skylake

From: Andi Kleen
Date: Thu Oct 15 2015 - 19:38:44 EST

Next message: Andi Kleen: "[PATCH 2/4] x86, perf: Factor out BTS enable/disable functions"
Previous message: Doug Anderson: "Re: [PATCH v2] usb: dwc2: host: Fix use after free w/ simultaneous irqs"
In reply to: Andi Kleen: "[PATCH 4/4] x86, perf: Use INST_RETIRED.PREC_DIST for cycles:pp on Skylake"
Next in thread: Andi Kleen: "[PATCH 2/4] x86, perf: Factor out BTS enable/disable functions"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

From: Andi Kleen <ak@xxxxxxxxxxxxxxx>

The SKL PMU code had a problem with LBR freezing. When a counter
overflows already in the PMI handler, the LBR would be frozen
early and not be unfrozen until the next PMI. This means we would
get stale LBR information.

Depending on the workload this could happen for a few percent
of the PMIs for cycles in adaptive frequency mode, because the frequency
algorithm regularly goes down to very low periods.

This patch implements a new PMU ack sequence that avoids this problem.
The new sequence is:

- (counters are disabled with GLOBAL_CTRL)
There should be no further increments of the counters by later instructions; and
thus no additional PMI (and thus no additional freezing).

- ack the APIC

Clear the APIC PMI LVT entry so that any later interrupt is delivered and is
not lost due to the PMI LVT entry being masked. A lost PMI interrupt could lead to
LBRs staying frozen without entering the PMI handler

- Ack the PMU counters. This unfreezes the LBRs on Skylake (but not
on earlier CPUs which rely on DEBUGCTL writes for this)

- Reenable counters

The WRMSR will start the counters counting again (and will be ordered after the
APIC LVT PMI entry write since WRMSR is architecturally serializing). Since the
APIC PMI LVT is unmasked, any PMI which is caused by these perfmon counters
will trigger an NMI (but the delivery may be delayed until after the next
IRET)

One side effect is that the old retry loop is not possible anymore,
as the counters stay unacked for the majority of the PMI handler,
but that is not a big loss, as "profiling" the PMI was always
a bit dubious. For the old ack sequence it is still supported.

In principle the sequence should work on other CPUs too, but
since I only tested on Skylake it is only enabled there.

Signed-off-by: Andi Kleen <ak@xxxxxxxxxxxxxxx>
---
arch/x86/kernel/cpu/perf_event.h | 1 +
arch/x86/kernel/cpu/perf_event_intel.c | 35 ++++++++++++++++++++++++++--------
2 files changed, 28 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index 499f533..fcf01c7 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -551,6 +551,7 @@ struct x86_pmu {
struct x86_pmu_quirk *quirks;
int perfctr_second_write;
bool late_ack;
+ bool status_ack_after_apic;
unsigned (*limit_period)(struct perf_event *event, unsigned l);

/*
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index f63360b..69a545e 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -1789,6 +1789,7 @@ static int intel_pmu_handle_irq(struct pt_regs *regs)
struct cpu_hw_events *cpuc;
int bit, loops;
u64 status;
+ u64 orig_status;
int handled;

cpuc = this_cpu_ptr(&cpu_hw_events);
@@ -1803,13 +1804,16 @@ static int intel_pmu_handle_irq(struct pt_regs *regs)
handled = intel_pmu_drain_bts_buffer();
handled += intel_bts_interrupt();
status = intel_pmu_get_status();
+ orig_status = status;
if (!status)
goto done;

loops = 0;
again:
intel_pmu_lbr_read();
- intel_pmu_ack_status(status);
+ if (!x86_pmu.status_ack_after_apic)
+ __intel_pmu_enable_all(0, true);
+
if (++loops > 100) {
static bool warned = false;
if (!warned) {
@@ -1877,15 +1881,20 @@ again:
x86_pmu_stop(event, 0);
}

- /*
- * Repeat if there is more work to be done:
- */
- status = intel_pmu_get_status();
- if (status)
- goto again;
+
+ if (!x86_pmu.status_ack_after_apic) {
+ /*
+ * Repeat if there is more work to be done:
+ */
+ status = intel_pmu_get_status();
+ if (status)
+ goto again;
+ }

done:
- __intel_pmu_enable_all(0, true);
+ if (!x86_pmu.status_ack_after_apic)
+ __intel_pmu_enable_all(0, true);
+
/*
* Only unmask the NMI after the overflow counters
* have been reset. This avoids spurious NMIs on
@@ -1893,6 +1902,15 @@ done:
*/
if (x86_pmu.late_ack)
apic_write(APIC_LVTPC, APIC_DM_NMI);
+
+ /*
+ * Ack the PMU late. This avoids bogus freezing
+ * on Skylake CPUs.
+ */
+ if (x86_pmu.status_ack_after_apic) {
+ intel_pmu_ack_status(orig_status);
+ __intel_pmu_enable_all(0, true);
+ }
return handled;
}

@@ -3514,6 +3532,7 @@ __init int intel_pmu_init(void)
case 78: /* 14nm Skylake Mobile */
case 94: /* 14nm Skylake Desktop */
x86_pmu.late_ack = true;
+ x86_pmu.status_ack_after_apic = true;
memcpy(hw_cache_event_ids, skl_hw_cache_event_ids, sizeof(hw_cache_event_ids));
memcpy(hw_cache_extra_regs, skl_hw_cache_extra_regs, sizeof(hw_cache_extra_regs));
intel_pmu_lbr_init_skl();
--
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Andi Kleen: "[PATCH 2/4] x86, perf: Factor out BTS enable/disable functions"
Previous message: Doug Anderson: "Re: [PATCH v2] usb: dwc2: host: Fix use after free w/ simultaneous irqs"
In reply to: Andi Kleen: "[PATCH 4/4] x86, perf: Use INST_RETIRED.PREC_DIST for cycles:pp on Skylake"
Next in thread: Andi Kleen: "[PATCH 2/4] x86, perf: Factor out BTS enable/disable functions"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]