Re: [REGRESSION] Perf (userspace) broken on big.LITTLE systems since v6.5

From: James Clark
Date: Tue Aug 20 2024 - 04:58:23 EST




On 17/08/2024 2:38 am, Atish Kumar Patra wrote:
On Fri, Aug 16, 2024 at 8:30 AM Ian Rogers <irogers@xxxxxxxxxx> wrote:

On Fri, Aug 16, 2024 at 2:23 AM James Clark <james.clark@xxxxxxxxxx> wrote:



On 15/08/2024 6:29 pm, Ian Rogers wrote:
On Wed, Aug 14, 2024 at 9:28 AM James Clark <james.clark@xxxxxxxxxx> wrote:
On 07/08/2024 9:54 am, Thorsten Leemhuis wrote:
On 01.08.24 21:05, Ian Rogers wrote:
On Wed, Dec 6, 2023 at 4:09 AM Linux regression tracking #update
(Thorsten Leemhuis) <regressions@xxxxxxxxxxxxx> wrote:

[TLDR: This mail in primarily relevant for Linux kernel regression
tracking. See link in footer if these mails annoy you.]

On 22.11.23 00:43, Bagas Sanjaya wrote:
On Tue, Nov 21, 2023 at 09:08:48PM +0900, Hector Martin wrote:
Perf broke on all Apple ARM64 systems (tested almost everything), and
according to maz also on Juno (so, probably all big.LITTLE) since v6.5.

#regzbot fix: perf parse-events: Make legacy events lower priority than
sysfs/JSON
#regzbot ignore-activity

Note, this is still broken.

Hmmm, so all that became somewhat messy. Arnaldo, what's the way out of
this? Or is this a "we are screwed one way or another and someone has to
bite the bullet" situation?

Ciao, Thorsten

The patch changed the priority in the case
that you do something like:

$ perf stat -e 'armv8_pmuv3_0/cycles/' benchmark

but if you do:

$ perf stat -e 'cycles' benchmark

then the broken behavior will happen as legacy events have priority
over sysfs/json events in that case. To fix this you need to revert:
4f1b067359ac Revert "perf parse-events: Prefer sysfs/JSON hardware
events over legacy"

This causes some testing issues resolved in this unmerged patch series:
https://lore.kernel.org/lkml/20240510053705.2462258-1-irogers@xxxxxxxxxx/

There is a bug as the arm_dsu PMU advertises an event called "cycles"
and this PMU is present on Ampere systems. Reverting the commit above
will cause an issue as the commit 7b100989b4f6 ("perf evlist: Remove
__evlist__add_default") to fix ARM's BIG.little systems (opening a
cycles event on all PMUs not just 1) will cause the arm_dsu event to
be opened by perf record and fail as the event won't support sampling.

The patch https://lore.kernel.org/lkml/20240525152927.665498-1-irogers@xxxxxxxxxx/
fixes this by only opening the cycles event on core PMUs when choosing
default events.

Rather than take this patch the revert happened as Linus runs the
command "perf record -e cycles:pp" (ie using a specified event and not
defaults) and considers it a regression in the perf tool that on an
Ampere system to need to do "perf record -e
'armv8_pmuv3_0/cycles/pp'". It was pointed out that not specifying -e
will choose the cycles event correctly and with better precision the
pp for systems that support it, but it was still considered a
regression in the perf tool so the revert was made to happen. There is
a lack of perf testing coverage for ARM, in particular as they choose
to do everything in a different way to x86. The patch in question was
in the linux-next tree for weeks without issues.

ARM/Ampere could fix this by renaming the event from cycles to
cpu_cycles, or by following Intel's convention that anything uncore
uses the name clockticks rather than cycles. This could break people
who rely on an event called arm_dsu/cycles/ but I imagine such people
are rare. There has been no progress I'm aware of on renaming the
event.

Making perf not terminate on opening an event for perf record seems
like the most likely workaround as that is at least something under
the tool maintainers control. ARM have discussed doing this on the
lists:
https://lore.kernel.org/lkml/f30f676e-a1d7-4d6b-94c1-3bdbd1448887@xxxxxxx/
but since the revert in v6.10 no patches have appeared for the v6.11
merge window. Feature work like coresight improvements and ARMv9 are
being actively pursued by ARM, but feature work won't resolve this
regression.


I got some hardware with the DSU PMU so I'm going to have a go at trying
to send some fixes for this. My initial idea was to try incorporate the
"not terminate on opening" change as discussed in the link directly
above. And then do the revert of the "revert of prefer sysfs/json".

Thanks, I think this would be good. The biggest issue is that none of
the record logic expects a file descriptor to be not opened, deleting
unopened evsels from the evlist breaks all the indexing into the
mmaps, etc. Tbh, you probably wouldn't do the code this way if was
written afresh. Perhaps a hashmap would map from an evsel to ring
buffer mmaps, etc. Trying to avoid having global state and benefitting
from encapsulation. I'd focus on just doing the expedient thing in the
changes, which probably just means making the record code tolerant of
evsels that fail to open and not modifying the evlist due to the risk
it breaks the indices.


Thanks for the tips.

(To point out the obvious, this work wouldn't be necessary if arm_dsu
event were renamed from "cycles" to "cpu_cycles" which would also make
it more intention revealing alongside the arm_dsu's "bus_cycles" event
name).


I understand but I can imagine the following conversation if we rename that:

User: "I updated my kernel and now my (non Perf) tool fails to open
the DSU cycles event because it doesn't exist anymore"

Linus/maintainers: "Oh ok yes that was a userspace breaking change,
lets revert it"

Just because Perf can handle 3 different names for cycles doesn't mean
other tools can.

cycles was a bad event name, dsu is a terrible name for what is mainly
the l3 cache, the risk that the two are combined get broken I'm fine
with as neoverse users with uncore permissions are say much rarer than
Apple M users. Having a cycles and a bus_cycles event is already
ambiguous, they sound the same. Renaming cycles to cpu_cycles would be
best.

FWIW I don't think Juno currently is broken if the kernel supports
extended type ID? I could have missed some output in this thread but it
seems like it's mostly related to Apple M hardware. I'm also a bit
confused why the "supports extended type" check fails there, but maybe
the v6.9 commit 25412c036 from Mark is missing?

So I think your later emails clarify Arnaldo is probably missing:
https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/commit/drivers/perf/arm_pmu.c?h=perf-tools-next&id=5c816728651ae425954542fed64d21d40cb75a9f

Fwiw, the Apple M hardware issue came to me by way of Mark Rutland
(iirc), this regression report, etc. My understanding is that Apple M
has something like a v2 ARM PMU and the legacy events are encoded
incorrectly in the driver for this. The regression in v6.5 happened

I'm not sure about that. The M PMU events may be incomplete, but the two
that are there have a mapping that looks sane:

static const unsigned m1_pmu_perf_map[PERF_COUNT_HW_MAX] = {
PERF_MAP_ALL_UNSUPPORTED,
[PERF_COUNT_HW_CPU_CYCLES] = M1_PMU_PERFCTR_CPU_CYCLES,
[PERF_COUNT_HW_INSTRUCTIONS] = M1_PMU_PERFCTR_INSTRUCTIONS,
/* No idea about the rest yet */
};

And they map to the same named events:

static struct attribute *m1_pmu_event_attrs[] = {
M1_PMU_EVENT_ATTR(cycles, M1_PMU_PERFCTR_CPU_CYCLES),
M1_PMU_EVENT_ATTR(instructions, M1_PMU_PERFCTR_INSTRUCTIONS),
NULL,
};

So in this case I can't see using legacy vs sysfs events making a
difference. Maybe there is some other case that was mentioned in a
previous thread that I missed though.

No idea, iirc Mark Rutland requested not to use legacy events for Apple M.

because ARM's core PMUs had previously been treated as uncore PMUs,
meaning we wouldn't try to program legacy events on them. Fixing the
handling of ARM's core PMUs broke Apple M due to the broken legacy
event mappings. Why not fix the Apple M PMU driver? Well there was
anyway a similar RISC-V issue reported by Atish Patra (iirc) where the
RISC-V PMU driver wants to delegate the mapping of legacy events to
the perf tool so the driver needn't be aware of all and future RISC-V
configurations. The fix discussed with Mark, Atish, etc. has been to
swap the priority of legacy and sysfs/json events so that the latter
has priority. We need the revert of the revert as currently we only do
this if a PMU is specified with an event, not for the general wildcard
PMUs case that most people use. There was huge fallout from flipping

Yep makes sense to do the revert if RISC-V isn't going to support any
legacy events. Although from what I understand that would technically
only require JSON to be the highest priority? Because putting named
events in sysfs still requires kernel involvement so doesn't get you any
further than supporting the legacy events?

The sysfs and json event handling is interwoven, for example you can
add to a sysfs event with json information. There are basically two
approaches in the event parser, hardcoded legacy things and event
names (optionally with PMU names). I'm trying to get rid of the
hardcoded legacy things as they were fine when you had a single core
type, but I want to have events everywhere - say instructions and
cycles on a GPU so we can IPC on a GPU. For RISC-V as long as the
legacy events are covered as names in json and json/sysfs has priority
over legacy then things will be fine.


RISC-V does want to support legacy events as that's how users on other
architectures are used to
run perf. It would be weird if we don't support it.

Our initial reasoning behind relying on json for legacy events to
avoid vendor specific encodings for these
events in the driver. Unlike other ISAs, RISC-V ISA doesn't define an
event encoding for these legacy
events. As a result every platform vendor will have custom encoding.
Managing them in the driver is
cumbersome. Many thanks to Ian for posting the patches to reverse the
priority which works fine for RISC-V.

However, I understand that it is easier said than done and some use
cases are broken. We also discovered
there are few other use cases which still have the same problem even
if we solve the bigger problem via json parsing
for legacy events.

1. Any other user profiling application that invokes perf system calls
directly may also try to just legacy event attributes in
perf_event_attr.
Android simpleperf application also falls in this category. We need to
describe the platform specific encoding somewhere for these
applications.


I think this use case is important. Not just for profiling applications but even something that wants to monitor itself. I imagine opening PERF_COUNT_HW_CPU_CYCLES or INSRUCTIONS is actually somewhat common, and I don't think every application that wants to do perf system calls should have to maintain JSON mappings for all platforms. That doesn't sound feasible to me, unless there is a smart way to do it? Maybe the mappings could be in libperf or something? But then that still requires everyone to add that as a dependency and keep it up to date. By that point you might as well just add them in the kernel and keep the existing interface.

James