Re: [GIT PULL] perf tools changes for v6.10

From: Ian Rogers
Date: Sat May 25 2024 - 10:09:39 EST


On Fri, May 24, 2024 at 9:20 PM Linus Torvalds
<torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
>
> On Fri, 24 May 2024 at 20:48, Ian Rogers <irogers@xxxxxxxxxx> wrote:
> >
> > Thanks for the report. TL;DR try putting the PMU name with the event
> > name, so "cycles:pp" becomes "armv8_pmuv3_0/cycles/pp", where
> > armv8_pmuv3_0 is the name of the PMU from /sys/devices.
>
> What? No.
>
> If that is the new rule, then I'm just going to revert. I'm not going
> to use some random different PMU names across architectures when all I
> want is just "cycles".

The issue is that perf previously would do two things, if you asked
for a 'cycles' event then as the name was a legacy one it would encode
the type in the perf_event_attr as PERF_TYPE_HARDWARE and the config
as PERF_COUNT_HW_CPU_CYCLES job done. With BIG.little/hybrid/.. those
events now need opening on multiple PMUs otherwise you end up only
sampling on either the big or the little core. On Intel with hybrid,
cycles had to become cpu_core/cycles/ and cpu_atom/cycles/.

If we look at an event name that isn't a legacy one, like
inst_retired.any, then when no PMU is specified perf's behavior is to
try to open this event on every PMU it can. On Intel this event will
be found on the 'cpu' PMU, or on a hybrid Intel on cpu_core and
cpu_atom. Were the event 'data_read' then it could be found on the
uncore_imc_free_running_0 and uncore_imc_free_running_1 PMUs. My point
here is that wildcarding an event to every PMU that supports it is
arguably a feature as it avoids users needing to remember a specific
PMU name. PMU names are often duplicated and it would be laborious to
name them all.

> In fact, the "cycles:pp" is a complete red herring. Just doing a simple
>
> $ perf record sleep 1
>
> with no explicit expression at all, results in that same
>
> Error:
> cycles:P: PMU Hardware doesn't support sampling/overflow-interrupts.
> Try 'perf stat'
>
> because perf *ITSELF* uses the sane format by default.
>
> So ABSOLUTELY NO. We're not changing the rules to "You have to know
> some idiotic per-architecture JSON magic".
>
> I don't want any excuses like this. No "You are holding it wrong".
> This is a BUG. Treat it as such.
>
> And yes, "perf record -vv" shows that it apparently picked some insane
> arm_dsu_0 event. Which just shows that trying to parse the JSON rules
> instead of just having sane defaults is clearly not working.
>
> So here's the deal: I will revert your commit tomorrow unless you show
> that you are taking it seriously and have a sane fix.
>
> Because no, "You are holding it wrong" is not the answer. Never was,
> never will be. Things used to work, they don't work any more. That
> shit is not acceptable.

I wasn't trying to say you were holding it wrong, I was trying to give
you practical advice that would solve your problem whilst we determine
what the right fix is.

The question is what to do about events when no PMU is specified, here
are the alternatives I see:

1) make legacy event names special and only open them on core PMUs,
2) make sure PMUs, like arm_dsu_0 in your case, don't advertise events
matching legacy names like cycles,
3) make perf record only scan PMUs that support sampling,
4) in the case that we're using cycles:P as a default event, only open
that on core PMUs (on Intel make cycles:P ->
cpu_core/cycles/P,cpu_atom/cycles/P).

A revert achieves (1) and I consider it a regression as now the
behavior of 'cycles' doesn't match 'inst_retired.any' and the user
somehow needs to know that certain event names are special.
Changing the driver (2) is a workaround, but it feels like a burden on
the driver writers.
In order to determine whether a PMU supports sampling (3) we'd need to
try probe using perf_event_open. The perf_event_open may fail due to
things like host/guest permissions, etc. so we need fallbacks. We do
something similar here:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/perf/util/print-events.c#n242
(4) is the simplest change and I think it lowers the surprise from the
behavior change in the patch. It means that were you to do "perf
record -e cycles ..." you would still need to specify a PMU on Linus'
special ARM box. We may lack permissions to read sysfs and so then it
is hard to determine the PMU name, but the code was likely to fail
then anyway.

I'll write a patch to do (4), hopefully this qualifies as being
serious, but it would be useful if the interested parties could work
out what they think the behavior should be.

Thanks,
Ian

> Linus
>