Re: [RFC PATCH 2/2] x86/perf/amd: Resolve NMI latency issues when multiple PMCs are active
From: Peter Zijlstra
Date: Fri Mar 15 2019 - 08:03:25 EST
On Mon, Mar 11, 2019 at 04:48:51PM +0000, Lendacky, Thomas wrote:
> @@ -467,6 +470,45 @@ static void amd_pmu_wait_on_overflow(int idx, u64 config)
> }
> }
>
> +/*
> + * Because of NMI latency, if multiple PMC counters are active we need to take
> + * into account that multiple PMC overflows can generate multiple NMIs but be
> + * handled by a single invocation of the NMI handler (think PMC overflow while
> + * in the NMI handler). This could result in subsequent unknown NMI messages
> + * being issued.
> + *
> + * Attempt to mitigate this by using the number of active PMCs to determine
> + * whether to return NMI_HANDLED if the perf NMI handler did not handle/reset
> + * any PMCs. The per-CPU perf_nmi_counter variable is set to a minimum of one
> + * less than the number of active PMCs or 2. The value of 2 is used in case the
> + * NMI does not arrive at the APIC in time to be collapsed into an already
> + * pending NMI.
LAPIC I really do hope?!
> + */
> +static int amd_pmu_mitigate_nmi_latency(unsigned int active, int handled)
> +{
> + /* If multiple counters are not active return original handled count */
> + if (active <= 1)
> + return handled;
Should we not reset perf_nmi_counter in this case?
> +
> + /*
> + * If a counter was handled, record the number of possible remaining
> + * NMIs that can occur.
> + */
> + if (handled) {
> + this_cpu_write(perf_nmi_counter,
> + min_t(unsigned int, 2, active - 1));
> +
> + return handled;
> + }
> +
> + if (!this_cpu_read(perf_nmi_counter))
> + return NMI_DONE;
> +
> + this_cpu_dec(perf_nmi_counter);
> +
> + return NMI_HANDLED;
> +}
> +
> static struct event_constraint *
> amd_get_event_constraints(struct cpu_hw_events *cpuc, int idx,
> struct perf_event *event)
> @@ -689,6 +731,7 @@ static __initconst const struct x86_pmu amd_pmu = {
>
> .amd_nb_constraints = 1,
> .wait_on_overflow = amd_pmu_wait_on_overflow,
> + .mitigate_nmi_latency = amd_pmu_mitigate_nmi_latency,
> };
Again, you could just do amd_pmu_handle_irq() and avoid an extra
callback.
Anyway, we already had code to deal with spurious NMIs from AMD; see
commit:
63e6be6d98e1 ("perf, x86: Catch spurious interrupts after disabling counters")
And that looks to be doing something very much the same. Why then do you
still need this on top?