Re: broken cycle counts from perf record in frequency mode [Was: Re: deducing CPU clock rate over time from cycle samples]

From: Milian Wolff
Date: Tue Sep 05 2017 - 08:26:46 EST


On Tuesday, September 5, 2017 5:40:58 AM CEST Andi Kleen wrote:
> > The cycle value gets associated with a sample via it's period value, which
> > is used by `perf report` in the analysis. If I get a single "broken"
> > sample with
>
> I always thought it just used the number of samples?

No, that is not the case. It uses the cycle period as "weight" by default.
Note also the recent patch set for `perf annotate` by Taeung Song which adds
the functionality to switch between sample or period fractions. I'm actually
not sure whether that exists for `perf report` yet.

In some situations where the cycle period values associated with the samples
are correct, I actually also saw how this is useful: I have seen perf data
files, where tons of samples got recorded around syscall entry/exit, but most
samples only had tiny cycle values associated with them. If one only looks at
number of samples, then it would look like the syscalls are expensive. While
in reality, way more cycles are executed in userspace. This issue was/is
apparent when looking at `perf report` values vs. the FlameGraph visualization
created by the normal `stackcollapse-perf.pl` script. The latter does only
look at the number of samples, the former takes the sample period value.
Hotspot also used the former approach, but then I adapted perf's behavior.

> > a cycle count of, say 1E14 and then a million other samples, each with
> > "sane" cycle counts of let's say 1E5, then the one broken sample will
> > hold 50% of the total amount of measured cycles. So perf report will show
> > that the function where the broken sample points to will have a cost of
> > 50%.
>
> I don't think I've seen such a situation. Did you?

I have seen this situation. This is what this current revival of this thread
is all about. Without such issues, I wouldn't claim it's such a serious issue.

> BTW I'm not arguing against fixing it, but typically I just
> recommend to avoid frequency mode. The fast sampling at the beginning
> has caused a range of low level PMU bugs and it is hard to reason about
> because of its complex behavior. Also it has no protection against
> synchronizing with repeated patterns in the execution, which
> can cause bad shadowing effects. If you use the Intel
> event aliases they have all sensible periods set by default.

I think we both agree that the frequency mode as-is should not be the default.
But it is, and this is a serious issue in my opinion. We will need to find a
sensible default for the event period and use that mode by default. I nowadays
always add something like `-c 3000000` which gives me roughly 1k samples per
second on a ~3GHz machine. It's just a ballpark figure and of course gets
influenced by frequency scaling, but it's good enough for me. We could use a
similar approach to find a period based on the max CPU clock rate
automatically. But of course that would only work for cycles, and not for
instructions or any of the other fancy event counters. But since the frequency
mode is probably similarly broken there, it should not be the default. Better
ask the user for explicit values rather than doing something automatically
which can lead to broken results.

Cheers

--
Milian Wolff | milian.wolff@xxxxxxxx | Senior Software Engineer
KDAB (Deutschland) GmbH&Co KG, a KDAB Group company
Tel: +49-30-521325470
KDAB - The Qt Experts

Attachment: smime.p7s
Description: S/MIME cryptographic signature