[PATCH V3 1/5] perf/x86/intel: fix event update for auto-reload

From: kan . liang
Date: Mon Jan 29 2018 - 11:30:45 EST


From: Kan Liang <kan.liang@xxxxxxxxxxxxxxx>

There is a bug when mmap read event->count with large PEBS enabled.
Here is an example.
#./read_count
0x71f0
0x122c0
0x1000000001c54
0x100000001257d
0x200000000bdc5

In fixed period mode, the auto-reload mechanism could be enabled for
PEBS events. But the calculation of event->count does not take the
auto-reload values into account. Anyone who read the event->count will
get wrong result, e.g x86_pmu_read. The calculation of hwc->period_left
is wrong either, which will impact the calculation of the period for
the first record in PEBS multiple records.

The issue was introduced with the auto-reload mechanism enabled since
commit 851559e35fd5 ("perf/x86/intel: Use the PEBS auto reload mechanism
when possible")

Introduce intel_pmu_save_and_restart_reload to calculate the event->count
only for auto-reload.

There is a small gap between 'PEBS hardware is armed' and 'the NMI is
handled'. Because of the gap, the first record also needs to be specially
handled.
The formula to calculate the increments of event->count is as below.
The increments = the period for first record +
(reload_times - 1) * reload_val +
the gap
- The 'the period for first record' is the period left from last PMI,
which can be got from the previous event value.
- For the second and later records, the period is exactly the reload
value. Just need to simply add (reload_times - 1) * reload_val
- Because of the auto-reload, the start point of counting is alwyas
(-reload_val). So the calculation of 'the gap' needs to be corrected by
adding reload_val.
The period_left needs to do the same adjustment as well.

There is nothing need to do in x86_perf_event_set_period(). Because it
is fixed period. The period_left is already adjusted.

Fixes: 851559e35fd5 ("perf/x86/intel: Use the PEBS auto reload mechanism
when possible")
Signed-off-by: Kan Liang <kan.liang@xxxxxxxxxxxxxxx>
---
arch/x86/events/intel/ds.c | 69 ++++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 67 insertions(+), 2 deletions(-)

diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index 8156e47..6533426 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -1303,17 +1303,82 @@ get_next_pebs_record_by_bit(void *base, void *top, int bit)
return NULL;
}

+/*
+ * Specific intel_pmu_save_and_restart() for auto-reload.
+ * It only be called from drain_pebs().
+ */
+static int intel_pmu_save_and_restart_reload(struct perf_event *event,
+ int reload_times)
+{
+ struct hw_perf_event *hwc = &event->hw;
+ int shift = 64 - x86_pmu.cntval_bits;
+ u64 reload_val = hwc->sample_period;
+ u64 prev_raw_count, new_raw_count;
+ u64 delta;
+
+ WARN_ON((reload_times == 0) || (reload_val == 0));
+
+ /*
+ * drain_pebs() only happens when the PMU is disabled.
+ * It doesnot need to specially handle the previous event value.
+ * The hwc->prev_count will be updated in x86_perf_event_set_period().
+ */
+ prev_raw_count = local64_read(&hwc->prev_count);
+ rdpmcl(hwc->event_base_rdpmc, new_raw_count);
+
+ /*
+ * Now we have the new raw value and have updated the prev
+ * timestamp already. We can now calculate the elapsed delta
+ * (event-)time and add that to the generic event.
+ *
+ * Careful, not all hw sign-extends above the physical width
+ * of the count.
+ *
+ * There is a small gap between 'PEBS hardware is armed' and 'the NMI
+ * is handled'. Because of the gap, the first record also needs to be
+ * specially handled.
+ * The formula to calculate the increments of event->count is as below.
+ * The increments = the period for first record +
+ * (reload_times - 1) * reload_val +
+ * the gap
+ * 'The period for first record' can be got from -prev_raw_count.
+ *
+ * 'The gap' = new_raw_count + reload_val. Because the start point of
+ * counting is always -reload_val for auto-reload.
+ *
+ * The period_left needs to do the same adjustment by adding
+ * reload_val.
+ */
+ delta = (reload_val << shift) + (new_raw_count << shift) -
+ (prev_raw_count << shift);
+ delta >>= shift;
+
+ local64_add(reload_val * (reload_times - 1), &event->count);
+ local64_add(delta, &event->count);
+ local64_sub(delta, &hwc->period_left);
+
+ return x86_perf_event_set_period(event);
+}
+
static void __intel_pmu_pebs_event(struct perf_event *event,
struct pt_regs *iregs,
void *base, void *top,
int bit, int count)
{
+ struct hw_perf_event *hwc = &event->hw;
struct perf_sample_data data;
struct pt_regs regs;
void *at = get_next_pebs_record_by_bit(base, top, bit);

- if (!intel_pmu_save_and_restart(event) &&
- !(event->hw.flags & PERF_X86_EVENT_AUTO_RELOAD))
+ if (hwc->flags & PERF_X86_EVENT_AUTO_RELOAD) {
+ /*
+ * Now, auto-reload is only enabled in fixed period mode.
+ * The reload value is always hwc->sample_period.
+ * May need to change it, if auto-reload is enabled in
+ * freq mode later.
+ */
+ intel_pmu_save_and_restart_reload(event, count);
+ } else if (!intel_pmu_save_and_restart(event))
return;

while (count > 1) {
--
2.7.4