Re: [PATCH] perf/x86: fix event counter update issue
From: Peter Zijlstra
Date: Tue Nov 29 2016 - 04:25:37 EST
So caveat that I'm ill and cannot think much..
On Mon, Nov 28, 2016 at 11:26:46AM -0800, kan.liang@xxxxxxxxx wrote:
> Here, all the possible failure cases are listed.
> Terms:
> - new: current PMU counter value which read from rdpmcl.
> - prev: previous counter value which is stored in &hwc->prev_count.
> - in PMI/not in PMI: the event update happens in PMI handler or not.
> Current code to calculate delta:
> delta = (new << shift) - (prev << shift);
> delta >>= shift;
>
> Case A: Not in PMI. new > prev. But delta is negative.
> That's the failure case of Test 2.
> delta is s64 type. new and prev are u64 type. If the new is big
> enough, after doing left shift and sub, the bit 64 of the result may
> still be 1.
> After converting to s64, the sign flag will be set. Since delta is
> s64 type, arithmetic right shift is applied, which copy the sign flag
> into empty bit positions on the upper end of delta.
> It can be fixed by adding the max count value.
>
> Here is the real data for test2 on KNL.
> new = aea96e1927
> prev = ffffff0000000001
> delta = aea96e1927000000 - 1000000 = aea96e1926000000
> aea96e1926000000 >> 24 = ffffffaea96e1926 << negative delta
How can this happen? IIRC the thing increments, we program a negative
value, and when it passes 0 we generate a PMI.
And note that we _ALWAYS_ set the IN bits, even for !sampling events.
Also note we set max_period to (1<<31) - 1, so we should never exceed 31
bits.
> Case B: In PMI. new > prev. delta is positive.
> That's the failure case of Test 3.
> The PMI is triggered by overflow. But there is latency between
> overflow and PMI handler. So new has small amount.
> Current calculation lose the max count value.
That doesn't make sense, per the 31bit limit.
> Case C: In PMI. new < prev. delta is negative.
> The PMU counter may be start from a big value. E.g. the fixed period
> is small.
> It can be fixed by adding the max count value.
Doesn't make sense, how can this happen?