Re: [PATCH] perf/x86: fix event counter update issue

From: Stephane Eranian
Date: Mon Nov 28 2016 - 14:41:24 EST


On Mon, Nov 28, 2016 at 11:26 AM, <kan.liang@xxxxxxxxx> wrote:
>
> From: Kan Liang <kan.liang@xxxxxxxxx>
>
> Fixes: ec3232bdf851 ("perf_counter: x86: More accurate counter update")
>
> A bug can be easily reproduced by perf stat loop program on KNL/SLM.
> The loop program is just "for (i=0;i<loops;i++);"
> "loops" is set to different value during the test.
> Test 1:
> perf stat -x, -C1 -e cycles -- taskset 0x2 ./loop 60000000000
> 451739576564,,cycles,313694125185,100.00
> Test 2:
> perf stat -x, -C1 -e cycles -- taskset 0x2 ./loop 100000000000
> 18446743727079256064,,cycles,313694125185,100.00
> Test 3:
> perf stat -x, -C1 -e cycles -- taskset 0x2 ./loop 200000000000
> 406240041973,,cycles,313694125185,100.00
> Only test 1 has correct result.
> The result for test 2 is too big. Test 3 is too small.
>
> The bug happens more frequently on KNL/SLM due to narrower counters,
> but should be generic issue on all platforms.
>
> The issues are caused by mistakenly calculating the counter
> increment (delta) in x86_perf_event_update for some cases.
>
> Here, all the possible failure cases are listed.
> Terms:
> - new: current PMU counter value which read from rdpmcl.
> - prev: previous counter value which is stored in &hwc->prev_count.
> - in PMI/not in PMI: the event update happens in PMI handler or not.
> Current code to calculate delta:
> delta = (new << shift) - (prev << shift);
> delta >>= shift;
>
> Case A: Not in PMI. new > prev. But delta is negative.
> That's the failure case of Test 2.
> delta is s64 type. new and prev are u64 type. If the new is big
> enough, after doing left shift and sub, the bit 64 of the result may
> still be 1.
> After converting to s64, the sign flag will be set. Since delta is
> s64 type, arithmetic right shift is applied, which copy the sign flag
> into empty bit positions on the upper end of delta.
> It can be fixed by adding the max count value.
>
> Here is the real data for test2 on KNL.
> new = aea96e1927
> prev = ffffff0000000001


How can new be so large here?
I mean between prev and new, the counter wrapped around. And it took
that many occurrences (of cycles) to handle the overflow?

>
> delta = aea96e1927000000 - 1000000 = aea96e1926000000
> aea96e1926000000 >> 24 = ffffffaea96e1926 << negative delta
>
> Case B: In PMI. new > prev. delta is positive.
> That's the failure case of Test 3.
> The PMI is triggered by overflow. But there is latency between
> overflow and PMI handler. So new has small amount.
> Current calculation lose the max count value.
>
> Case C: In PMI. new < prev. delta is negative.
> The PMU counter may be start from a big value. E.g. the fixed period
> is small.
> It can be fixed by adding the max count value.
>
I am not sure I understand why PMI should matter here. What matters is
prev vs. current and the wrap-around.
Please explain.
Thanks.

>
> There is still a case which delta could be wrong.
> The case is that event update just happens in PMI and overflow gap. It's
> not in PMI, new > prev, and delta is positive. Current calculation may
> lose the max count value.
> But this case rarely happens. So the rare case doesn't handle here.
>
> Reported-by: Lukasz Odzioba <lukasz.odzioba@xxxxxxxxx>
> Signed-off-by: Kan Liang <kan.liang@xxxxxxxxx>
> Tested-by: Lukasz Odzioba <lukasz.odzioba@xxxxxxxxx>
> ---
> arch/x86/events/core.c | 23 +++++++++++++++++++----
> arch/x86/events/intel/core.c | 4 ++--
> arch/x86/events/intel/p4.c | 2 +-
> arch/x86/events/perf_event.h | 2 +-
> 4 files changed, 23 insertions(+), 8 deletions(-)
>
> diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
> index 6c3b0ef..c351ac4 100644
> --- a/arch/x86/events/core.c
> +++ b/arch/x86/events/core.c
> @@ -63,7 +63,7 @@ u64 __read_mostly hw_cache_extra_regs
> * Can only be executed on the CPU where the event is active.
> * Returns the delta events processed.
> */
> -u64 x86_perf_event_update(struct perf_event *event)
> +u64 x86_perf_event_update(struct perf_event *event, bool pmi)
> {
> struct hw_perf_event *hwc = &event->hw;
> int shift = 64 - x86_pmu.cntval_bits;
> @@ -100,6 +100,21 @@ u64 x86_perf_event_update(struct perf_event *event)
> delta = (new_raw_count << shift) - (prev_raw_count << shift);
> delta >>= shift;
>
> + /*
> + * Correct delta for special cases caused by the late PMI handle.
> + * Case1: PMU counter may be start from a big value.
> + * In PMI, new < prev. delta is negative.
> + * Case2: new is big enough which impact the sign flag.
> + * The delta will be negative after arithmetic right shift.
> + * Case3: In PMI, new > prev.
> + * The new - prev lose the max count value.
> + *
> + * There may be event update in PMI and overflow gap,
> + * but it rarely happen. The rare case doesn't handle here.
> + */
> + if (((delta > 0) && pmi) || (delta < 0))
> + delta += x86_pmu.cntval_mask + 1;
> +
> local64_add(delta, &event->count);
> local64_sub(delta, &hwc->period_left);
>
> @@ -1342,7 +1357,7 @@ void x86_pmu_stop(struct perf_event *event, int flags)
> * Drain the remaining delta count out of a event
> * that we are disabling:
> */
> - x86_perf_event_update(event);
> + x86_perf_event_update(event, (flags == 0));
> hwc->state |= PERF_HES_UPTODATE;
> }
> }
> @@ -1446,7 +1461,7 @@ int x86_pmu_handle_irq(struct pt_regs *regs)
>
> event = cpuc->events[idx];
>
> - val = x86_perf_event_update(event);
> + val = x86_perf_event_update(event, true);
> if (val & (1ULL << (x86_pmu.cntval_bits - 1)))
> continue;
>
> @@ -1867,7 +1882,7 @@ early_initcall(init_hw_perf_events);
>
> static inline void x86_pmu_read(struct perf_event *event)
> {
> - x86_perf_event_update(event);
> + x86_perf_event_update(event, false);
> }
>
> /*
> diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
> index a74a2db..69d65e6 100644
> --- a/arch/x86/events/intel/core.c
> +++ b/arch/x86/events/intel/core.c
> @@ -1830,7 +1830,7 @@ static void intel_pmu_nhm_workaround(void)
> for (i = 0; i < 4; i++) {
> event = cpuc->events[i];
> if (event)
> - x86_perf_event_update(event);
> + x86_perf_event_update(event, false);
> }
>
> for (i = 0; i < 4; i++) {
> @@ -2002,7 +2002,7 @@ static void intel_pmu_add_event(struct perf_event *event)
> */
> int intel_pmu_save_and_restart(struct perf_event *event)
> {
> - x86_perf_event_update(event);
> + x86_perf_event_update(event, true);
> /*
> * For a checkpointed counter always reset back to 0. This
> * avoids a situation where the counter overflows, aborts the
> diff --git a/arch/x86/events/intel/p4.c b/arch/x86/events/intel/p4.c
> index eb05335..8117de8 100644
> --- a/arch/x86/events/intel/p4.c
> +++ b/arch/x86/events/intel/p4.c
> @@ -1024,7 +1024,7 @@ static int p4_pmu_handle_irq(struct pt_regs *regs)
> /* it might be unflagged overflow */
> overflow = p4_pmu_clear_cccr_ovf(hwc);
>
> - val = x86_perf_event_update(event);
> + val = x86_perf_event_update(event, true);
> if (!overflow && (val & (1ULL << (x86_pmu.cntval_bits - 1))))
> continue;
>
> diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
> index c6b25ac..09c9db2 100644
> --- a/arch/x86/events/perf_event.h
> +++ b/arch/x86/events/perf_event.h
> @@ -712,7 +712,7 @@ extern u64 __read_mostly hw_cache_extra_regs
> [PERF_COUNT_HW_CACHE_OP_MAX]
> [PERF_COUNT_HW_CACHE_RESULT_MAX];
>
> -u64 x86_perf_event_update(struct perf_event *event);
> +u64 x86_perf_event_update(struct perf_event *event, bool pmi);
>
> static inline unsigned int x86_pmu_config_addr(int index)
> {
> --
> 2.4.3
>