Re: [RFC PATCH 2/2] x86/perf/amd: Resolve NMI latency issues when multiple PMCs are active

From: Peter Zijlstra
Date: Fri Mar 15 2019 - 08:03:25 EST

Next message: Jeremy Sowden: "[PATCH net] vti4: removed duplicate log message."
Previous message: Heikki Krogerus: "Re: [PATCH v2 4/6] usb: roles: add API to get usb_role_switch by node"
In reply to: Lendacky, Thomas: "[RFC PATCH 2/2] x86/perf/amd: Resolve NMI latency issues when multiple PMCs are active"
Next in thread: Lendacky, Thomas: "Re: [RFC PATCH 2/2] x86/perf/amd: Resolve NMI latency issues when multiple PMCs are active"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Mon, Mar 11, 2019 at 04:48:51PM +0000, Lendacky, Thomas wrote:
> @@ -467,6 +470,45 @@ static void amd_pmu_wait_on_overflow(int idx, u64 config)
> }
> }
>
> +/*
> + * Because of NMI latency, if multiple PMC counters are active we need to take
> + * into account that multiple PMC overflows can generate multiple NMIs but be
> + * handled by a single invocation of the NMI handler (think PMC overflow while
> + * in the NMI handler). This could result in subsequent unknown NMI messages
> + * being issued.
> + *
> + * Attempt to mitigate this by using the number of active PMCs to determine
> + * whether to return NMI_HANDLED if the perf NMI handler did not handle/reset
> + * any PMCs. The per-CPU perf_nmi_counter variable is set to a minimum of one
> + * less than the number of active PMCs or 2. The value of 2 is used in case the
> + * NMI does not arrive at the APIC in time to be collapsed into an already
> + * pending NMI.

LAPIC I really do hope?!

> + */
> +static int amd_pmu_mitigate_nmi_latency(unsigned int active, int handled)
> +{
> + /* If multiple counters are not active return original handled count */
> + if (active <= 1)
> + return handled;

Should we not reset perf_nmi_counter in this case?

> +
> + /*
> + * If a counter was handled, record the number of possible remaining
> + * NMIs that can occur.
> + */
> + if (handled) {
> + this_cpu_write(perf_nmi_counter,
> + min_t(unsigned int, 2, active - 1));
> +
> + return handled;
> + }
> +
> + if (!this_cpu_read(perf_nmi_counter))
> + return NMI_DONE;
> +
> + this_cpu_dec(perf_nmi_counter);
> +
> + return NMI_HANDLED;
> +}
> +
> static struct event_constraint *
> amd_get_event_constraints(struct cpu_hw_events *cpuc, int idx,
> struct perf_event *event)
> @@ -689,6 +731,7 @@ static __initconst const struct x86_pmu amd_pmu = {
>
> .amd_nb_constraints = 1,
> .wait_on_overflow = amd_pmu_wait_on_overflow,
> + .mitigate_nmi_latency = amd_pmu_mitigate_nmi_latency,
> };

Again, you could just do amd_pmu_handle_irq() and avoid an extra
callback.

Anyway, we already had code to deal with spurious NMIs from AMD; see
commit:

63e6be6d98e1 ("perf, x86: Catch spurious interrupts after disabling counters")

And that looks to be doing something very much the same. Why then do you
still need this on top?

Next message: Jeremy Sowden: "[PATCH net] vti4: removed duplicate log message."
Previous message: Heikki Krogerus: "Re: [PATCH v2 4/6] usb: roles: add API to get usb_role_switch by node"
In reply to: Lendacky, Thomas: "[RFC PATCH 2/2] x86/perf/amd: Resolve NMI latency issues when multiple PMCs are active"
Next in thread: Lendacky, Thomas: "Re: [RFC PATCH 2/2] x86/perf/amd: Resolve NMI latency issues when multiple PMCs are active"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]