Re: [RESEND PATCH V2 1/4] perf/x86/intel: fix event update for auto-reload
From: Peter Zijlstra
Date: Wed Jan 24 2018 - 07:26:32 EST
On Mon, Jan 08, 2018 at 07:15:13AM -0800, kan.liang@xxxxxxxxx wrote:
> The formula to calculate the event->count is as below:
> event->count = period left from last time +
> (reload_times - 1) * reload_val +
> latency of PMI handler
>
> prev_count is the last observed hardware counter value. Just the same as
> non-auto-reload, its absolute value is the period of the first record.
> It should not update with each reload. Because it doesn't 'observe' the
> hardware counter for each auto-reload.
>
> For the second and later records, the period is exactly the reload
> value. Just need to simply add (reload_times - 1) * reload_val to
> event->count.
>
> The calculation of the latency of PMI handler is a little bit different
> as non-auto-reload. Because the start point is -reload_value. It needs
> to be adjusted by adding reload_value.
> The period_left needs to do the same adjustment.
What's this about the PMI latency, we don't care about that in any other
situation, right? Sure the PMI takes a bit of time, but we're not
correcting for that anywhere, so why start now?
> There is nothing need to do in x86_perf_event_set_period(). Because it
> is fixed period. The period_left is already adjusted.
Fixes tag is missing.
> Signed-off-by: Kan Liang <kan.liang@xxxxxxxxxxxxxxx>
> ---
> arch/x86/events/intel/ds.c | 69 ++++++++++++++++++++++++++++++++++++++++++++--
> 1 file changed, 67 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
> index 3674a4b..cc1f373 100644
> --- a/arch/x86/events/intel/ds.c
> +++ b/arch/x86/events/intel/ds.c
> @@ -1251,17 +1251,82 @@ get_next_pebs_record_by_bit(void *base, void *top, int bit)
> return NULL;
> }
>
> +/*
> + * Specific intel_pmu_save_and_restart() for auto-reload.
> + */
> +static int intel_pmu_save_and_restart_reload(struct perf_event *event,
> + u64 reload_val,
> + int reload_times)
> +{
> + struct hw_perf_event *hwc = &event->hw;
> + int shift = 64 - x86_pmu.cntval_bits;
> + u64 prev_raw_count, new_raw_count;
> + u64 delta;
> +
> + if ((reload_times == 0) || (reload_val == 0))
> + return intel_pmu_save_and_restart(event);
Like Jiri, I find this confusing at best. If we need to call that one,
you shouldn't have called this function to begin with.
At best, have a WARN here or something.
> +
> + /*
> + * Careful: an NMI might modify the previous event value.
> + *
> + * Our tactic to handle this is to first atomically read and
> + * exchange a new raw count - then add that new-prev delta
> + * count to the generic event atomically:
> + */
For now this seems to only get called from *drain_pebs* which afaict
only happens when we've disabled the PMU (either from sched_task or
PMI).
Now, you want to put this in the pmu::read() path, and that does not
disable the PMU, but I don't think we can drain the PEBS buffer while
its active, that's too full of races, so even there you'll have to
disable stuff.
So I don't think this is correct/desired for this case.
> +again:
> + prev_raw_count = local64_read(&hwc->prev_count);
> + rdpmcl(hwc->event_base_rdpmc, new_raw_count);
> +
> + if (local64_cmpxchg(&hwc->prev_count, prev_raw_count,
> + new_raw_count) != prev_raw_count)
> + goto again;
> +
> + /*
> + * Now we have the new raw value and have updated the prev
> + * timestamp already. We can now calculate the elapsed delta
> + * (event-)time and add that to the generic event.
> + *
> + * Careful, not all hw sign-extends above the physical width
> + * of the count.
> + *
> + * event->count = period left from last time +
> + * (reload_times - 1) * reload_val +
> + * latency of PMI handler
*
> + * The period left from last time can be got from -prev_count.
> + * The start points of counting is always -reload_val.
> + * So the real latency of PMI handler is reload_val + new_raw_count.
> + */
That is very confused, the PMI latency is utterly unrelated to anything
you do here.
> + delta = (reload_val << shift) + (new_raw_count << shift) -
> + (prev_raw_count << shift);
> + delta >>= shift;
> +
> + local64_add(reload_val * (reload_times - 1), &event->count);
> + local64_add(delta, &event->count);
And this is still wrong I think. Consider the case where !reload_times.
We can easily call pmu::read() twice in one period. In that case we
should increment count with (new - prev).
Only once we get a new sample and are known to have wrapped, do we need
to consider that wrap.
> + local64_sub(delta, &hwc->period_left);
> +
> + return x86_perf_event_set_period(event);
> +}
> +
> static void __intel_pmu_pebs_event(struct perf_event *event,
> struct pt_regs *iregs,
> void *base, void *top,
> int bit, int count)
> {
> + struct hw_perf_event *hwc = &event->hw;
> struct perf_sample_data data;
> struct pt_regs regs;
> void *at = get_next_pebs_record_by_bit(base, top, bit);
>
> - if (!intel_pmu_save_and_restart(event) &&
> - !(event->hw.flags & PERF_X86_EVENT_AUTO_RELOAD))
> + if (hwc->flags & PERF_X86_EVENT_AUTO_RELOAD) {
> + /*
> + * Now, auto-reload is only enabled in fixed period mode.
> + * The reload value is always hwc->sample_period.
> + * May need to change it, if auto-reload is enabled in
> + * freq mode later.
> + */
> + intel_pmu_save_and_restart_reload(event, hwc->sample_period,
> + count);
Since you pass in @event, hwc->sample_period is already available to it,
no need to pass that in as well.
> + } else if (!intel_pmu_save_and_restart(event))
> return;
>
> while (count > 1) {