Re: [RFC PATCH v3 0/3] x86/perf/amd: AMD PMC counters and NMI latency

From: Lendacky, Thomas
Date: Tue Apr 02 2019 - 09:09:22 EST


On 4/2/19 8:03 AM, Peter Zijlstra wrote:
> On Mon, Apr 01, 2019 at 09:46:33PM +0000, Lendacky, Thomas wrote:
>> This patch series addresses issues with increased NMI latency in newer
>> AMD processors that can result in unknown NMI messages when PMC counters
>> are active.
>>
>> The following fixes are included in this series:
>>
>> - Resolve a race condition when disabling an overflowed PMC counter,
>> specifically when updating the PMC counter with a new value.
>> - Resolve handling of active PMC counter overflows in the perf NMI
>> handler and when to report that the NMI is not related to a PMC.
>> - Remove earlier workaround for spurious NMIs by re-ordering the
>> PMC stop sequence to disable the PMC first and then remove the PMC
>> bit from the active_mask bitmap. As part of disabling the PMC, the
>> code will wait for an overflow to be reset.
>>
>> The last patch re-works the order of when the PMC is removed from the
>> active_mask. There was a comment from a long time ago about having
>> to clear the bit in active_mask before disabling the counter because
>> the perf NMI handler could re-enable the PMC again. Looking at the
>> handler today, I don't see that as possible, hence the reordering. The
>> question will be whether the Intel PMC support will now have issues.
>> There is still support for using x86_pmu_handle_irq() in the Intel
>> core.c file. Did Intel have any issues with spurious NMIs in the past?
>> Peter Z, any thoughts on this?
>
> I can't remember :/ I suppose we'll see if anything pops up after these
> here patches. At least then we get a chance to properly document things.
>
>> Also, I couldn't completely get rid of the "running" bit because it
>> is used by arch/x86/events/intel/p4.c. An old commit comment that
>> seems to indicate the p4 code suffered the spurious interrupts:
>> 03e22198d237 ("perf, x86: Handle in flight NMIs on P4 platform").
>> So maybe that partially answers my previous question...
>
> Yeah, the P4 code is magic, and I don't have any such machines left, nor
> do I think does Cyrill who wrote much of that.
>
> I have vague memories of the P4 thing crashing with Vince's perf_fuzzer,
> but maybe I'm wrong.
>
> Ideally we'd find a willing victim to maintain that thing, or possibly
> just delete it, dunno if anybody still cares.
>
>
> Anyway, I like these patches, but I cannot apply since you send them
> base64 encoded and my script chokes on that.

Hmmm, I'm using stgit and it's either that or our mail system that is
causing the base64 encoding. It happened once before, let me re-send using
straight git. I'll remove the RFC on the re-send, too.

Thanks,
Tom
>