Re: [RFC PATCH 1/2] x86/perf/amd: Resolve race condition when disabling PMC
From: Lendacky, Thomas
Date: Fri Mar 15 2019 - 10:06:57 EST
On 3/15/19 5:50 AM, Peter Zijlstra wrote:
> On Mon, Mar 11, 2019 at 04:48:44PM +0000, Lendacky, Thomas wrote:
>> On AMD processors, the detection of an overflowed counter in the NMI
>> handler relies on the current value of the counter. So, for example, to
>> check for overflow on a 48 bit counter, bit 47 is checked to see if it
>> is 1 (not overflowed) or 0 (overflowed).
>>
>> There is currently a race condition present when disabling and then
>> updating the PMC. Increased NMI latency in newer AMD processors makes this
>> race condition more pronounced.
>
> Increased NMI latency also makes the results less useful :/ What amount
> of skid are we talking about, and is there anything AMD is going to do
> about this?
I haven't looked into the amount of skid, but, yes, the hardware team is
looking at this.
>
>> If the counter value has overflowed, it is
>> possible to update the PMC value before the NMI handler can run.
>
> Arguably the WRMSR should sync against the PMI. That is the beahviour
> one would expect.
>
> Isn't that something you can fix in ucode? And could you very please
> tell the hardware people this is disguisting?
Currently, there's nothing they've found that can be done in ucode for
this. But, yes, they know it's a problem and they're looking at what they
can do.
>
>> The updated PMC value is not an overflowed value, so when the perf NMI
>> handler does run, it will not find an overflowed counter. This may
>> appear as an unknown NMI resulting in either a panic or a series of
>> messages, depending on how the kernel is configured.
>>
>> To eliminate this race condition, the PMC value must be checked after
>> disabling the counter in x86_pmu_disable_all(), and, if overflowed, must
>> wait for the NMI handler to reset the value before continuing. Add a new,
>> optional, callable function that can be used to test for and resolve this
>> condition.
>>
>> Cc: <stable@xxxxxxxxxxxxxxx> # 4.14.x-
>> Signed-off-by: Tom Lendacky <thomas.lendacky@xxxxxxx>
>
>> +static void amd_pmu_wait_on_overflow(int idx, u64 config)
>> +{
>> + unsigned int i;
>> + u64 counter;
>> +
>> + /*
>> + * We shouldn't be calling this from NMI context, but add a safeguard
>> + * here to return, since if we're in NMI context we can't wait for an
>> + * NMI to reset an overflowed counter value.
>> + */
>> + if (in_nmi())
>> + return;
>> +
>> + /*
>> + * If the interrupt isn't enabled then we won't get the NMI that will
>> + * reset the overflow condition, so return.
>> + */
>> + if (!(config & ARCH_PERFMON_EVENTSEL_INT))
>> + return;
>> +
>> + /*
>> + * Wait for the counter to be reset if it has overflowed. This loop
>> + * should exit very, very quickly, but just in case, don't wait
>> + * forever...
>> + */
>> + for (i = 0; i < OVERFLOW_WAIT_COUNT; i++) {
>> + rdmsrl(x86_pmu_event_addr(idx), counter);
>> + if (counter & (1ULL << (x86_pmu.cntval_bits - 1)))
>> + break;
>> +
>> + /* Might be in IRQ context, so can't sleep */
>> + udelay(1);
>> + }
>> +}
>
> Argh.. that's horrible, as I'm sure you fully appreciate :/
Yeah...
>
>> diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
>> index b684f0294f35..f1d2f70000cd 100644
>> --- a/arch/x86/events/core.c
>> +++ b/arch/x86/events/core.c
>> @@ -606,6 +606,9 @@ void x86_pmu_disable_all(void)
>> continue;
>> val &= ~ARCH_PERFMON_EVENTSEL_ENABLE;
>> wrmsrl(x86_pmu_config_addr(idx), val);
>> +
>> + if (x86_pmu.wait_on_overflow)
>> + x86_pmu.wait_on_overflow(idx, val);
>> }
>> }
>
> One alternative is adding amd_pmu_disable_all() to amd/core.c and using
> that. Then you can also change the loop to do the wait after all the
> WRMSRs, if that helps with latency.
I thought about that for both this and the next patch. But since it would
be duplicating all the code I went with the added callbacks. It might be
worth it for this patch to have an AMD disable_all callback since it's
not a lot of code to duplicate compared to handle_irq and I like the idea
of doing the overflow check after all the WRMSRs.
Thanks for the review, Peter.
Tom
>