Re: [PATCH v1] perf evlist: Force adding default events only to core PMUs

From: Ian Rogers
Date: Tue May 28 2024 - 13:40:13 EST


On Tue, May 28, 2024 at 10:01 AM Linus Torvalds
<torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
>
> On Mon, 27 May 2024 at 22:37, Ian Rogers <irogers@xxxxxxxxxx> wrote:
> >
> > If you do:
> >
> > $ perf stat -e cycles ...
>
> You always talk about "perf stat", because you want to ignore a big
> part of the issueL
>
> > The issue is about legacy events.
>
> No.
>
> The issue isn't "perf stat".
>
> That works. Even with the broken version.
>
> > I have a patch that's WIP for this, but I also think we could
> > also agree that when >1 PMU advertises an event, perf's behavior when
> > matching should be to open all such events. You avoid this by
> > specifying a PMU name.
>
> Christ. You're still ignoring the elephant in the room.
>
> Stop using "perf stat" as an example. It's bogus. You're ignoring the issue.
>
> Lookie here:
>
> $ perf stat make
> ...
> Performance counter stats for 'make':
>
> 5,262.78 msec task-clock #
> 1.195 CPUs utilized
> 46,891 context-switches #
> 8.910 K/sec
> 6 cpu-migrations #
> 1.140 /sec
> 198,402 page-faults #
> 37.699 K/sec
> 10,238,763,227 cycles #
> 1.946 GHz
> 16,280,644,955 instructions # 1.59
> insn per cycle
> 3,252,743,314 branches #
> 618.066 M/sec
> 83,702,386 branch-misses #
> 2.57% of all branches
>
> 4.405792739 seconds time elapsed
>
> 4.287784000 seconds user
> 0.921132000 seconds sys
>
> but then
>
> $ perf record make
> Error:
> cycles:P: PMU Hardware doesn't support
> sampling/overflow-interrupts. Try 'perf stat'
>
> because that broken thing (a) picked a different cycles than before
> and (b) your argument that it should pick both IS WRONG BECAUSE ONE OF
> THEM DOESN'T EVEN WORK.

By default it picked "cycles:P" with this patch it now picks "<core
pmu>/cycles/P". As your Tested-by attested this does work.

> Why is this so hard to just accept? Why do you keep mis-stating the problem?

I'm not. What your saying is that the arm_dsu_0 PMU advertising cycles
doesn't work if you use it for perf record. I agree. This change fixed
the issue. What to do if someone writes "perf record -e cycles" well
there are two advertised cycles events isn't it reasonable perf record
try to open them both?

> How hard is it to realize that I DO NOT WANT "perf stat"? The perf
> error message is bogus crap. If I ask for a "perf record", it
> shouldn't pick the wrong PMU that can't do it, and then say "do you
> want to do something else"?

It's not about perf record having some kind of intelligence and
picking a PMU. perf record either opens an event on the PMU you ask
for it it wild cards across them all. For legacy events they used to
be handled differently and I'm trying to fix this. In part so I can
fix the PMU behavior on the Apple M1 and later CPUs that fail to
implement legacy events properly in their PMU driver. You've said you
used to use Apple CPUs for ARM testing, I'm trying to fix a problem
that will help you.

> I don't care a whit for "legacy events". I care about the fact that
> you changed the code to pick the WRONG event, and then you are blaming
> anything but that.

Sure, I added "if user == linus then pick wrong PMU". The code was
reviewed by IBM and Intel. ARM were on the CC list. The change baked
on linux-next for a good long while. All of this points to my problem
that I'm often fixing problems for ARM with a complete lack of
testing/reviewing/acking... by them.

> If perf would go "Oh, this one doesn't support 'record', let's see if
> another one does", it would all be good.
>
> If perf would go "Oh, let's prioritize core events", it would all be good.

But as I've pointed out it wouldn't because then you'd break the
behavior of doing things like gathering memory bandwidth from uncore
PMUs. An example:
```
$ sudo perf stat -e data_read -a sleep 1

Performance counter stats for 'system wide':

4,447.10 MiB data_read

1.001748581 seconds time elapsed
```
By making the wildcard only work for core PMUs the best case is you'd make this:
```
$ sudo perf stat -e 'imc_free_running/data_read/' -a sleep 1

Performance counter stats for 'system wide':

4,454.56 MiB imc_free_running/data_read/

1.001865448 seconds time elapsed
```
But that wouldn't work again as ARM decided to mess up the naming
convention, my unmerged fix for that is here:
https://lore.kernel.org/lkml/20240515060114.3268149-1-irogers@xxxxxxxxxx/
it says Marvell but Marvell followed ARM's lead.

> But no. Your patch actively picked a bad event, and then you try to
> blame some "legacy" thing.
>
> Yes, the legacy thing picked the right event, but it's not even about
> legacy. You could have picked the right event any number of other
> ways.
>
> It's about "it damn well worked when you didn't go out of your way to
> pick the wrong event".
>
> In other words, this isn't about "legacy" and "new".

I'm not clear what you think is "new". All the events are being picked
in your case from sysfs, the way this has all worked is years if not
decades old. What is new is that because of an event name the behavior
should be uniform, motivated initially by fixing your other ARM test
platform of Apple.

> This is about "right" and "wrong". The old code picked right - for
> whatever reasons. The new code picked wrong - also for whatever
> reasons.
>
> Don't try to make it be anything else. Just admit that the new code
> picked the wrong PMU, instead of trying to make excuses for it.

I agree it picked the wrong PMU for default events. This was a problem
on no systems that anybody was bothering to test with. Having been
made aware of the issue I fixed it in this patch, you're welcome.

What is still not clear from this is what should the behavior be of:

$ perf record -e cycles ...

Should it wildcard all events and open them on all PMUs potentially
failing? Well this has always been perf's behavior were the event:

$ perf record -e inst_retired.any ...

where inst_retired.any could be an event advertised on an accelerator
or device where sampling doesn't work. If inst_retired.any doesn't
work for you as an example, pick another event that does. A GPU has
instructions and cycles so the likelihood of naming conflicts is high.

We can make perf record ignore opening events on PMUs that don't
support sampling, it's an invasive and non-trivial change not suited
to landing in 6.10. It is also a behavior change, see this thread for
how popular those are.

So 6.10 is now in a mess. We likely fail tests, reverting this change
has a bunch of consequences and presumably I'm expected to dig
through, figure those out and then provide fixes. Thanks!

For 6.11 I currently suggest we revert the revert and apply this
patch. This would also I think be the best thing to do for 6.10. I
appreciate my opinions are worth much less than others. I don't see
why the priority should be to fix things on an ARM system that nobody
is actively testing on rather than say Apple devices fixed by the
reverted change, RISC-V system, etc. Anyway...

Thanks,
Ian

> Linus
>