Re: [PATCH] perf/x86: fix event counter update issue

From: Stephane Eranian
Date: Mon Nov 28 2016 - 15:18:52 EST


On Mon, Nov 28, 2016 at 11:59 AM, Liang, Kan <kan.liang@xxxxxxxxx> wrote:
>
>
>> >
>> > Here, all the possible failure cases are listed.
>> > Terms:
>> > - new: current PMU counter value which read from rdpmcl.
>> > - prev: previous counter value which is stored in &hwc->prev_count.
>> > - in PMI/not in PMI: the event update happens in PMI handler or not.
>> > Current code to calculate delta:
>> > delta = (new << shift) - (prev << shift);
>> > delta >>= shift;
>> >
>> > Case A: Not in PMI. new > prev. But delta is negative.
>> > That's the failure case of Test 2.
>> > delta is s64 type. new and prev are u64 type. If the new is big
>> > enough, after doing left shift and sub, the bit 64 of the result may
>> > still be 1.
>> > After converting to s64, the sign flag will be set. Since delta is
>> > s64 type, arithmetic right shift is applied, which copy the sign flag
>> > into empty bit positions on the upper end of delta.
>> > It can be fixed by adding the max count value.
>> >
>> > Here is the real data for test2 on KNL.
>> > new = aea96e1927
>> > prev = ffffff0000000001
>>
>>
>> How can new be so large here?
>> I mean between prev and new, the counter wrapped around. And it took
>> that many occurrences (of cycles) to handle the overflow?
>
> There is no overflow in this case.
> The counter is 40bit width. The highest value which can be count without
> overflow is 0xfffffffffe
>
>>
>> >
>> > delta = aea96e1927000000 - 1000000 = aea96e1926000000
>> > aea96e1926000000 >> 24 = ffffffaea96e1926 << negative delta
>> >
>> > Case B: In PMI. new > prev. delta is positive.
>> > That's the failure case of Test 3.
>> > The PMI is triggered by overflow. But there is latency between
>> > overflow and PMI handler. So new has small amount.
>> > Current calculation lose the max count value.
>> >
>> > Case C: In PMI. new < prev. delta is negative.
>> > The PMU counter may be start from a big value. E.g. the fixed period
>> > is small.
>> > It can be fixed by adding the max count value.
>> >
>> I am not sure I understand why PMI should matter here. What matters is
>> prev vs. current and the wrap-around.
>> Please explain.
>> Thanks.
>
> Right, PMI shouldn't matter here.
> We should add max count value if delta is negative, no matter if itâs in PMI.
>
That sounds right, but then why do you have that boolean (bool pmi) in
your patch?

> To fix this issue, I once had a list which include all the possible cases.
> "non-PMI, new < prev, delta is negative" is one of them. But it rarely happen.
> So I remove it, but forget to modify the description of Case C.
>
> Thanks,
> Kan
>
>>
>> >
>> > There is still a case which delta could be wrong.
>> > The case is that event update just happens in PMI and overflow gap.
>> > It's not in PMI, new > prev, and delta is positive. Current
>> > calculation may lose the max count value.
>> > But this case rarely happens. So the rare case doesn't handle here.
>> >
>> > Reported-by: Lukasz Odzioba <lukasz.odzioba@xxxxxxxxx>
>> > Signed-off-by: Kan Liang <kan.liang@xxxxxxxxx>
>> > Tested-by: Lukasz Odzioba <lukasz.odzioba@xxxxxxxxx>
>> > ---
>> > arch/x86/events/core.c | 23 +++++++++++++++++++----
>> > arch/x86/events/intel/core.c | 4 ++--
>> > arch/x86/events/intel/p4.c | 2 +-
>> > arch/x86/events/perf_event.h | 2 +-
>> > 4 files changed, 23 insertions(+), 8 deletions(-)
>> >
>> > diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c index
>> > 6c3b0ef..c351ac4 100644
>> > --- a/arch/x86/events/core.c
>> > +++ b/arch/x86/events/core.c
>> > @@ -63,7 +63,7 @@ u64 __read_mostly hw_cache_extra_regs
>> > * Can only be executed on the CPU where the event is active.
>> > * Returns the delta events processed.
>> > */
>> > -u64 x86_perf_event_update(struct perf_event *event)
>> > +u64 x86_perf_event_update(struct perf_event *event, bool pmi)
>> > {
>> > struct hw_perf_event *hwc = &event->hw;
>> > int shift = 64 - x86_pmu.cntval_bits; @@ -100,6 +100,21 @@ u64
>> > x86_perf_event_update(struct perf_event *event)
>> > delta = (new_raw_count << shift) - (prev_raw_count << shift);
>> > delta >>= shift;
>> >
>> > + /*
>> > + * Correct delta for special cases caused by the late PMI handle.
>> > + * Case1: PMU counter may be start from a big value.
>> > + * In PMI, new < prev. delta is negative.
>> > + * Case2: new is big enough which impact the sign flag.
>> > + * The delta will be negative after arithmetic right shift.
>> > + * Case3: In PMI, new > prev.
>> > + * The new - prev lose the max count value.
>> > + *
>> > + * There may be event update in PMI and overflow gap,
>> > + * but it rarely happen. The rare case doesn't handle here.
>> > + */
>> > + if (((delta > 0) && pmi) || (delta < 0))
>> > + delta += x86_pmu.cntval_mask + 1;
>> > +
>> > local64_add(delta, &event->count);
>> > local64_sub(delta, &hwc->period_left);
>> >
>> > @@ -1342,7 +1357,7 @@ void x86_pmu_stop(struct perf_event *event,
>> int flags)
>> > * Drain the remaining delta count out of a event
>> > * that we are disabling:
>> > */
>> > - x86_perf_event_update(event);
>> > + x86_perf_event_update(event, (flags == 0));
>> > hwc->state |= PERF_HES_UPTODATE;
>> > }
>> > }
>> > @@ -1446,7 +1461,7 @@ int x86_pmu_handle_irq(struct pt_regs *regs)
>> >
>> > event = cpuc->events[idx];
>> >
>> > - val = x86_perf_event_update(event);
>> > + val = x86_perf_event_update(event, true);
>> > if (val & (1ULL << (x86_pmu.cntval_bits - 1)))
>> > continue;
>> >
>> > @@ -1867,7 +1882,7 @@ early_initcall(init_hw_perf_events);
>> >
>> > static inline void x86_pmu_read(struct perf_event *event) {
>> > - x86_perf_event_update(event);
>> > + x86_perf_event_update(event, false);
>> > }
>> >
>> > /*
>> > diff --git a/arch/x86/events/intel/core.c
>> > b/arch/x86/events/intel/core.c index a74a2db..69d65e6 100644
>> > --- a/arch/x86/events/intel/core.c
>> > +++ b/arch/x86/events/intel/core.c
>> > @@ -1830,7 +1830,7 @@ static void intel_pmu_nhm_workaround(void)
>> > for (i = 0; i < 4; i++) {
>> > event = cpuc->events[i];
>> > if (event)
>> > - x86_perf_event_update(event);
>> > + x86_perf_event_update(event, false);
>> > }
>> >
>> > for (i = 0; i < 4; i++) {
>> > @@ -2002,7 +2002,7 @@ static void intel_pmu_add_event(struct
>> perf_event *event)
>> > */
>> > int intel_pmu_save_and_restart(struct perf_event *event) {
>> > - x86_perf_event_update(event);
>> > + x86_perf_event_update(event, true);
>> > /*
>> > * For a checkpointed counter always reset back to 0. This
>> > * avoids a situation where the counter overflows, aborts the
>> > diff --git a/arch/x86/events/intel/p4.c b/arch/x86/events/intel/p4.c
>> > index eb05335..8117de8 100644
>> > --- a/arch/x86/events/intel/p4.c
>> > +++ b/arch/x86/events/intel/p4.c
>> > @@ -1024,7 +1024,7 @@ static int p4_pmu_handle_irq(struct pt_regs
>> *regs)
>> > /* it might be unflagged overflow */
>> > overflow = p4_pmu_clear_cccr_ovf(hwc);
>> >
>> > - val = x86_perf_event_update(event);
>> > + val = x86_perf_event_update(event, true);
>> > if (!overflow && (val & (1ULL << (x86_pmu.cntval_bits - 1))))
>> > continue;
>> >
>> > diff --git a/arch/x86/events/perf_event.h
>> > b/arch/x86/events/perf_event.h index c6b25ac..09c9db2 100644
>> > --- a/arch/x86/events/perf_event.h
>> > +++ b/arch/x86/events/perf_event.h
>> > @@ -712,7 +712,7 @@ extern u64 __read_mostly hw_cache_extra_regs
>> > [PERF_COUNT_HW_CACHE_OP_MAX]
>> > [PERF_COUNT_HW_CACHE_RESULT_MAX];
>> >
>> > -u64 x86_perf_event_update(struct perf_event *event);
>> > +u64 x86_perf_event_update(struct perf_event *event, bool pmi);
>> >
>> > static inline unsigned int x86_pmu_config_addr(int index) {
>> > --
>> > 2.4.3
>> >