Re: [RESEND PATCH V2 1/4] perf/x86/intel: fix event update for auto-reload

From: Liang, Kan
Date: Wed Jan 24 2018 - 10:45:11 EST




On 1/24/2018 7:26 AM, Peter Zijlstra wrote:
On Mon, Jan 08, 2018 at 07:15:13AM -0800, kan.liang@xxxxxxxxx wrote:

The formula to calculate the event->count is as below:

event->count = period left from last time +
(reload_times - 1) * reload_val +
latency of PMI handler

prev_count is the last observed hardware counter value. Just the same as
non-auto-reload, its absolute value is the period of the first record.
It should not update with each reload. Because it doesn't 'observe' the
hardware counter for each auto-reload.

For the second and later records, the period is exactly the reload
value. Just need to simply add (reload_times - 1) * reload_val to
event->count.

The calculation of the latency of PMI handler is a little bit different
as non-auto-reload. Because the start point is -reload_value. It needs
to be adjusted by adding reload_value.
The period_left needs to do the same adjustment.

What's this about the PMI latency, we don't care about that in any other
situation, right? Sure the PMI takes a bit of time, but we're not
correcting for that anywhere, so why start now?

The latency is the gap between PEBS hardware is armed and the NMI is
handled.

Sorry for the misleading description.
I will rewrite the description in V3.



There is nothing need to do in x86_perf_event_set_period(). Because it
is fixed period. The period_left is already adjusted.

Fixes tag is missing.

Will add it in V3.


Signed-off-by: Kan Liang <kan.liang@xxxxxxxxxxxxxxx>
---
arch/x86/events/intel/ds.c | 69 ++++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 67 insertions(+), 2 deletions(-)

diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index 3674a4b..cc1f373 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -1251,17 +1251,82 @@ get_next_pebs_record_by_bit(void *base, void *top, int bit)
return NULL;
}
+/*
+ * Specific intel_pmu_save_and_restart() for auto-reload.
+ */
+static int intel_pmu_save_and_restart_reload(struct perf_event *event,
+ u64 reload_val,
+ int reload_times)
+{
+ struct hw_perf_event *hwc = &event->hw;
+ int shift = 64 - x86_pmu.cntval_bits;
+ u64 prev_raw_count, new_raw_count;
+ u64 delta;
+
+ if ((reload_times == 0) || (reload_val == 0))
+ return intel_pmu_save_and_restart(event);

Like Jiri, I find this confusing at best. If we need to call that one,
you shouldn't have called this function to begin with.

At best, have a WARN here or something.


Will add a WARN in V3.

+
+ /*
+ * Careful: an NMI might modify the previous event value.
+ *
+ * Our tactic to handle this is to first atomically read and
+ * exchange a new raw count - then add that new-prev delta
+ * count to the generic event atomically:
+ */

For now this seems to only get called from *drain_pebs* which afaict
only happens when we've disabled the PMU (either from sched_task or
PMI).

Now, you want to put this in the pmu::read() path, and that does not
disable the PMU, but I don't think we can drain the PEBS buffer while
its active, that's too full of races, so even there you'll have to
disable stuff.

So I don't think this is correct/desired for this case.


Could we do something as below in the pmu::read() path? (not test yet)

if (pebs_needs_sched_cb(cpuc)) {
perf_pmu_disable();
intel_pmu_drain_pebs_buffer();
x86_perf_event_update();
perf_pmu_enable();
return;
}

x86_perf_event_update() is to handle the case !reload_times which you mentioned as below. Because the PMU is disabled, nothing will be changed for other cases.


+again:
+ prev_raw_count = local64_read(&hwc->prev_count);
+ rdpmcl(hwc->event_base_rdpmc, new_raw_count);
+
+ if (local64_cmpxchg(&hwc->prev_count, prev_raw_count,
+ new_raw_count) != prev_raw_count)
+ goto again;
+
+ /*
+ * Now we have the new raw value and have updated the prev
+ * timestamp already. We can now calculate the elapsed delta
+ * (event-)time and add that to the generic event.
+ *
+ * Careful, not all hw sign-extends above the physical width
+ * of the count.
+ *
+ * event->count = period left from last time +
+ * (reload_times - 1) * reload_val +
+ * latency of PMI handler
*
+ * The period left from last time can be got from -prev_count.
+ * The start points of counting is always -reload_val.
+ * So the real latency of PMI handler is reload_val + new_raw_count.
+ */

That is very confused, the PMI latency is utterly unrelated to anything
you do here.

Will fix the confusing comments in V3.


+ delta = (reload_val << shift) + (new_raw_count << shift) -
+ (prev_raw_count << shift);
+ delta >>= shift;
+
+ local64_add(reload_val * (reload_times - 1), &event->count);
+ local64_add(delta, &event->count);

And this is still wrong I think. Consider the case where !reload_times.

We can easily call pmu::read() twice in one period. In that case we
should increment count with (new - prev).

Only once we get a new sample and are known to have wrapped, do we need
to consider that wrap.

The code as above should fix this issue.


+ local64_sub(delta, &hwc->period_left);
+
+ return x86_perf_event_set_period(event);
+}
+
static void __intel_pmu_pebs_event(struct perf_event *event,
struct pt_regs *iregs,
void *base, void *top,
int bit, int count)
{
+ struct hw_perf_event *hwc = &event->hw;
struct perf_sample_data data;
struct pt_regs regs;
void *at = get_next_pebs_record_by_bit(base, top, bit);
- if (!intel_pmu_save_and_restart(event) &&
- !(event->hw.flags & PERF_X86_EVENT_AUTO_RELOAD))
+ if (hwc->flags & PERF_X86_EVENT_AUTO_RELOAD) {
+ /*
+ * Now, auto-reload is only enabled in fixed period mode.
+ * The reload value is always hwc->sample_period.
+ * May need to change it, if auto-reload is enabled in
+ * freq mode later.
+ */
+ intel_pmu_save_and_restart_reload(event, hwc->sample_period,
+ count);

Since you pass in @event, hwc->sample_period is already available to it,
no need to pass that in as well.

OK. I will change it.

Thanks,
Kan



+ } else if (!intel_pmu_save_and_restart(event))
return;
while (count > 1) {