Re: [PATCH v6 1/2] perf pmus: Sort/merge/aggregate PMUs like mrvl_ddr_pmu

From: Robin Murphy
Date: Wed Jun 12 2024 - 10:00:32 EST


On 12/06/2024 1:32 pm, Ian Rogers wrote:
On Wed, Jun 12, 2024 at 4:19 AM Aishwarya TCV <aishwarya.tcv@xxxxxxx> wrote:



On 15/05/2024 07:01, Ian Rogers wrote:
The mrvl_ddr_pmu is uncore and has a hexadecimal address suffix while
the previous PMU sorting/merging code assumes uncore PMU names start
with uncore_ and have a decimal suffix. Because of the previous
assumption it isn't possible to wildcard the mrvl_ddr_pmu.

Modify pmu_name_len_no_suffix but also remove the suffix number out
argument, this is because we don't know if a suffix number of say 100
is in hexadecimal or decimal. As the only use of the suffix number is
in comparisons, it is safe there to compare the values as hexadecimal.
Modify perf_pmu__match_ignoring_suffix so that hexadecimal suffixes
are ignored.

Only allow hexadecimal suffixes to be greater than length 2 (ie 3 or
more) so that S390's cpum_cf PMU doesn't lose its suffix.

Change the return type of pmu_name_len_no_suffix to size_t to
workaround GCC incorrectly determining the result could be negative.

Signed-off-by: Ian Rogers <irogers@xxxxxxxxxx>
---
tools/perf/util/pmu.c | 33 +++++++++++++--------
tools/perf/util/pmus.c | 67 ++++++++++++++++++++++++------------------
tools/perf/util/pmus.h | 7 ++++-
3 files changed, 65 insertions(+), 42 deletions(-)


Hi Ian,

Perf test "perf_all_PMU_test" is failing when run against
next-master(next-20240612) kernel with Arm64 on JUNO in our CI. It looks
like it is failing when run on JUNO alone. Verified by running on other
boards like RB5 and Ampere_altra and confirming that it does not fail on
these boards. Suspecting that the suffixed 'armv8_pmuv3_0' naming could
be the reason of test failure.

Reverting the change (3241d46f5f54) seems to fix it.

This works fine on Linux version v6.10-rc3

Failure log
------------
110: perf all PMU test:
--- start ---
test child forked, pid 8279
Testing armv8_pmuv3/br_immed_retired/
Event 'armv8_pmuv3/br_immed_retired/' not printed in:
# Running 'internals/synthesize' benchmark:
Computing performance of single threaded perf event synthesis by
synthesizing events on the perf process itself:
Average synthesis took: 1169.431 usec (+- 0.144 usec)
Average num. events: 35.000 (+- 0.000)
Average time per event 33.412 usec
Average data synthesis took: 1225.698 usec (+- 0.102 usec)
Average num. events: 119.000 (+- 0.000)
Average time per event 10.300 usec

Performance counter stats for 'perf bench internals synthesize':

3263664785 armv8_pmuv3_0/br_immed_retired/


25.472854464 seconds time elapsed

8.004791000 seconds user
17.060209000 seconds sys
---- end(-1) ----
110: perf all PMU test :
FAILED!

Hi Aishwarya,

Thanks for reporting an issue. The test should be pretty self
explanatory: it is doing a `perf stat -e
armv8_pmuv3/br_immed_retired/` and then looking for that in the
output. The event armv8_pmuv3/br_immed_retired/ comes from running
perf list. As you can see in the output the event did work, so perf
stat is working so nothing is actually broken here. What isn't working
is the perf stat output matching the command line event and this is
because of the unnecessary suffix on ARM's PMU name.

We have a problem that ARM have buggy PMU drivers, either from
introducing new naming conventions or by just being broken:
https://lore.kernel.org/lkml/CAP-5=fWNDkOpnYF=5v1aQkVDrDWsmw+zYX1pjS8hoiYVgZsRGA@xxxxxxxxxxxxxx/
I've also asked that ARM step up their testing, for example in the
event parsing testing the PMU is hardcoded to the x86 PMU name of
'cpu':
https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/tests/parse-events.c?h=perf-tools-next#n2317
On a cortex A53, then PMU is named 'armv8_cortex_a53':
```
$ ls /sys/devices/armv8_cortex_a53/
caps cpus events format perf_event_mux_interval_ms power
subsystem type uevent
```
This name appears better, so what's up with ARM's core PMU name?

With Devicetree, we are able to derive a descriptive PMU name from the compatible string provided by the DT. Under ACPI, however, all we get told is whether each CPU has a usable PMU or not, so the best we can do is work out how many different CPU microarchitectures we have overall and create a PMU instance for each type. We still don't know *what* each one is, just that they're different, hence ending up with a common name plus a suffix which we can increment for disambiguation if and when we do see something new - userspace can still piece together the "cpus" lists and MIDRs to figure out what's what, we just can't do much in the kernel itself.

Anyway, I'm tempted to fix this by just skipping the test on ARM given
ARM's overall broken state.

This isn't a driver issue, it's a "the behaviour of 'perf list' changed inconsistently" issue. I also had a brief dig into this using a different arm64 ACPI system, and I think I can broadly characterise the cause. This is prior to 3241d46f5f54:

root@crazy-taxi:~# ./perf-mainline list armv8_pmuv3

List of pre-defined events (to be used in -e or -M):


armv8_pmuv3_0:
L1-dcache-loads OR armv8_pmuv3_0/L1-dcache-loads/
L1-dcache-load-misses OR armv8_pmuv3_0/L1-dcache-load-misses/
L1-icache-loads OR armv8_pmuv3_0/L1-icache-loads/
L1-icache-load-misses OR armv8_pmuv3_0/L1-icache-load-misses/
dTLB-loads OR armv8_pmuv3_0/dTLB-loads/
dTLB-load-misses OR armv8_pmuv3_0/dTLB-load-misses/
iTLB-loads OR armv8_pmuv3_0/iTLB-loads/
iTLB-load-misses OR armv8_pmuv3_0/iTLB-load-misses/
branch-loads OR armv8_pmuv3_0/branch-loads/
branch-load-misses OR armv8_pmuv3_0/branch-load-misses/
l3d_cache_wb OR armv8_pmuv3_0/l3d_cache_wb/ [Kernel PMU event]


And this is after:

root@crazy-taxi:~# ./perf-next list armv8_pmuv3

List of pre-defined events (to be used in -e or -M):


armv8_pmuv3_0:
L1-dcache-loads OR armv8_pmuv3_0/L1-dcache-loads/
L1-dcache-load-misses OR armv8_pmuv3_0/L1-dcache-load-misses/
L1-icache-loads OR armv8_pmuv3_0/L1-icache-loads/
L1-icache-load-misses OR armv8_pmuv3_0/L1-icache-load-misses/
dTLB-loads OR armv8_pmuv3_0/dTLB-loads/
dTLB-load-misses OR armv8_pmuv3_0/dTLB-load-misses/
iTLB-loads OR armv8_pmuv3_0/iTLB-loads/
iTLB-load-misses OR armv8_pmuv3_0/iTLB-load-misses/
branch-loads OR armv8_pmuv3_0/branch-loads/
branch-load-misses OR armv8_pmuv3_0/branch-load-misses/
l3d_cache_wb OR armv8_pmuv3/l3d_cache_wb/ [Kernel PMU event]

See the difference in the last line - it appears that CPU PMU events which map to common hardware/cache events *do* still report the full PMU name, but any PMU-type-specific events show a truncated name in list and thus fail the test's strict match against the full name reported by stat.

Thanks,
Robin.