Re: [PATCH v19 11/11] perf: arm_pmuv3: Add support for the Branch Record Buffer Extension (BRBE)

From: Rob Herring
Date: Tue Feb 04 2025 - 10:08:53 EST


On Tue, Feb 4, 2025 at 6:03 AM James Clark <james.clark@xxxxxxxxxx> wrote:
>
>
>
> On 03/02/2025 5:58 pm, Rob Herring wrote:
> > On Mon, Feb 3, 2025 at 10:53 AM James Clark <james.clark@xxxxxxxxxx> wrote:
> >>
> >>
> >>
> >> On 03/02/2025 12:43 am, Rob Herring (Arm) wrote:
> >>> From: Anshuman Khandual <anshuman.khandual@xxxxxxx>
> >>>
> >>> The ARMv9.2 architecture introduces the optional Branch Record Buffer
> >>> Extension (BRBE), which records information about branches as they are
> >>> executed into set of branch record registers. BRBE is similar to x86's
> >>> Last Branch Record (LBR) and PowerPC's Branch History Rolling Buffer
> >>> (BHRB).

[...]

> >>> + /*
> >>> + * Require that the event filter and branch filter permissions match.
> >>> + *
> >>> + * The event and branch permissions can only mismatch if the user set
> >>> + * at least one of the privilege branch filters in PERF_SAMPLE_BRANCH_PLM_ALL.
> >>> + * Otherwise, the core will set the branch sample permissions in
> >>> + * perf_copy_attr().
> >>> + */
> >>> + if ((event->attr.exclude_user != !(branch_type & PERF_SAMPLE_BRANCH_USER)) ||
> >>> + (event->attr.exclude_kernel != !(branch_type & PERF_SAMPLE_BRANCH_KERNEL)) ||
> >>
> >> I don't think this one is right. By default perf_copy_attr() copies the
> >> exclude_ settings into the branch settings so this works, but if the
> >> user sets any _less_ permissive branch setting this fails. For example:
> >>
> >> # perf record -j any,u -- true
> >> Error:
> >> cycles:PH: PMU Hardware or event type doesn't support branch stack
> >> sampling.
> >>
> >> Here we want the default sampling permissions (exclude_kernel == 0,
> >> exclude_user == 0), but only user branch records, which doesn't match.
> >> It should be allowed because it doesn't include anything that we're not
> >> allowed to see.
> >
> > I know it is allowed(on x86), but why would we want that? If you do
> > something even more restricted:
> >
> > perf record -e cycles:k -j any,u -- true
> >
> > That's allowed on x86 and gives you samples with user addresses. But
> > all the events happened in the kernel. How does that make any sense?
> >
> > I suppose in your example, we could avoid attaching branch stack on
> > samples from the kernel. However, given how my example works, I'm
> > pretty sure that's not what x86 does.
> >
> > There's also combinations that are allowed, but record no samples.
> > Though I think that was with guest events. I've gone with reject
> > non-sense combinations as much as possible. We can easily remove those
> > restrictions later if needed. Changing the behavior later (for the
> > same configuration) wouldn't be good.
> >
> >
>
> Rejecting ones that produce no samples is fair enough, but my example
> does produce samples. To answer the question "why would we want that?",
> nothing major, but there are a few small reasons:
>
> * Perf includes both user and kernel by default, so the shortest
> command to only gather user branches doesn't work (-j any,u)
> * The test already checks for branch stack support like this, so old
> Perf test versions don't work

I would be more concerned about this one except that *we* wrote that
test. (I'm not sure why we wrote a new test rather than adapting
record_lbr.sh...)

> * You might only be optimising userspace, but still interested in the
> proportion of time spent or particular place in the kernel

How do you see that? It looks completely misleading to me. 'perf
report' seems to only list branch stack addresses in this case. There
doesn't seem to be any matching of the event address to branch stack
addresses.

> * Consistency with existing implementations and for people porting
> existing tools to Arm
> * It doesn't cost anything to support it (I think we just
> only check if exclude_* is set rather than !=)
> * Permissions checks should be handled by the core code so that
> they're consistent
> * What's the point of separate branch filters anyway if they always
> have to match the event filter?

IDK, I wish someone could tell me. I don't see the usecase for them
being mismatched.

In any case, I don't care too much one way or the other what we do
here. If everyone thinks we should relax this, then that's fine with
me.

> Some of these things could be fixed in Perf, but not in older versions.
> Even if we can't think of a real use case now, it doesn't sound like the
> driver should be so restrictive of an option that doesn't do any harm.
>
> >> This also makes the Perf branch test skip because it uses
> >> any,save_type,u to see if BRBE exists.
> >
> > Yes, I plan to update that if we keep this behavior.
> >
> >>> + (!is_kernel_in_hyp_mode() &&
> >>> + (event->attr.exclude_hv != !(branch_type & PERF_SAMPLE_BRANCH_HV))))
> >>> + return false;
> >>> +
> >>> + event->hw.branch_reg.config = branch_type_to_brbfcr(event->attr.branch_sample_type);
> >>> + event->hw.extra_reg.config = branch_type_to_brbcr(event->attr.branch_sample_type);
> >>> +
> >>> + return true;
> >>> +}
> >>> +
> >> [...]
> >>> +static const int brbe_type_to_perf_type_map[BRBINFx_EL1_TYPE_DEBUG_EXIT + 1][2] = {
> >>> + [BRBINFx_EL1_TYPE_DIRECT_UNCOND] = { PERF_BR_UNCOND, 0 },
> >>
> >> Does the second field go into 'new_type'? They all seem to be zero so
> >> I'm not sure why new_type isn't ignored instead of having it mapped.
> >
> > Well, left over from when all the Arm specific types were supported.
> > So yeah, that can be simplified.
> >
> >>> + [BRBINFx_EL1_TYPE_INDIRECT] = { PERF_BR_IND, 0 },
> >>> + [BRBINFx_EL1_TYPE_DIRECT_LINK] = { PERF_BR_CALL, 0 },
> >>> + [BRBINFx_EL1_TYPE_INDIRECT_LINK] = { PERF_BR_IND_CALL, 0 },
> >>> + [BRBINFx_EL1_TYPE_RET] = { PERF_BR_RET, 0 },
> >>> + [BRBINFx_EL1_TYPE_DIRECT_COND] = { PERF_BR_COND, 0 },
> >>> + [BRBINFx_EL1_TYPE_CALL] = { PERF_BR_CALL, 0 },
> >>> + [BRBINFx_EL1_TYPE_ERET] = { PERF_BR_ERET, 0 },
> >>> + [BRBINFx_EL1_TYPE_IRQ] = { PERF_BR_IRQ, 0 },
> >>
> >> How do ones that don't map to anything appear in Perf? For example
> >> BRBINFx_EL1_TYPE_TRAP is missing, and the test that was attached to the
> >> previous versions fails because it doesn't see the trap that jumps to
> >> the kernel, but it does still see the ERET back to userspace:
> >>
> >> [unknown]/trap_bench+0x20/-/-/-/0/ERET/-
> >>
> >> In older versions we'd also have BRBINFx_EL1_TYPE_TRAP mapping to
> >> PERF_BR_SYSCALL so you could see it go into the kernel before the return:
> >>
> >> trap_bench+0x1C/[unknown]/-/-/-/0/SYSCALL/-
> >> [unknown]/trap_bench+0x20/-/-/-/0/ERET/-
> >
> > My read of that was we should see a CALL in this case. Whether SVC
> > generates a TRAP or CALL depends on HFGITR_EL2.SVC_EL0 (table D18-2).
> > I assumed "SVC due to HFGITR_EL2.SVC_EL0" means when SVC_EL0 is set
> > (and set has additional conditions). We have SVC_EL0 cleared, so that
> > should be a CALL. Maybe the FVP has this wrong?
> >
>
> The test is doing this rather than a syscall:
>
> asm("mrs %0, ID_AA64ISAR0_EL1" : "=r" (val)); /* TRAP + ERET */
>
> So I think trap is right. Whether that should be mapped to SYSCALL or
> some other branch type I don't know, but the point is that it's missing now.

We aren't supporting any of the Arm specific traps/exceptions. One
reason is for consistency with x86 like you just argued for. The only
exception types supported are syscall and IRQ. Part of the issue is
there is no userspace control over enabling all the extra Arm ones.
There's no way to say enable all branches except debug, fault, etc.
exceptions. If we want to support these, I think there should be user
control over enabling them. But that can come later if there's any
demand for them.

Rob