Re: [PATCH V17 0/9] arm64/perf: Enable branch stack sampling

From: Mark Rutland
Date: Fri May 31 2024 - 09:01:36 EST


On Thu, May 30, 2024 at 06:41:14PM +0100, Mark Rutland wrote:
> On Thu, May 30, 2024 at 10:47:34AM +0100, James Clark wrote:
> > On 05/04/2024 03:46, Anshuman Khandual wrote:
> > > ------------------ Possible 'branch_sample_type' Mismatch -----------------
> > >
> > > Branch stack sampling attributes 'event->attr.branch_sample_type' generally
> > > remain the same for all the events during a perf record session.
> > >
> > > $perf record -e <event_1> -e <event_2> -j <branch_filters> [workload]
> > >
> > > event_1->attr.branch_sample_type == event_2->attr.branch_sample_type
> > >
> > > This 'branch_sample_type' is used to configure the BRBE hardware, when both
> > > events i.e <event_1> and <event_2> get scheduled on a given PMU. But during
> > > PMU HW event's privilege filter inheritance, 'branch_sample_type' does not
> > > remain the same for all events. Let's consider the following example
> > >
> > > $perf record -e cycles:u -e instructions:k -j any,save_type ls
> > >
> > > cycles->attr.branch_sample_type != instructions->attr.branch_sample_type
> > >
> > > Because cycles event inherits PERF_SAMPLE_BRANCH_USER and instruction event
> > > inherits PERF_SAMPLE_BRANCH_KERNEL. The proposed solution here configures
> > > BRBE hardware with 'branch_sample_type' from last event to be added in the
> > > PMU and hence captured branch records only get passed on to matching events
> > > during a PMU interrupt.
> > >
> >
> > Hi Anshuman,
> >
> > Surely because of this example we should merge? At least we have to try
> > to make the most common basic command lines work. Unless we expect all
> > tools to know whether the branch buffer is shared between PMUs on each
> > architecture or not. The driver knows though, so can merge the settings
> > because it all has to go into one BRBE.
>
> The difficulty here is that these are opened as independent events (not
> in the same event group), and so from the driver's PoV, this is no
> different two two users independently doing:
>
> perf record -e event:u -j any,save_type -p ${SOME_PID}
>
> perf record -e event:k -j any,save_type -p ${SOME_PID}
>
> .. where either would be surprised to get the merged result.

I took a look at how x86 handles this, and it looks like they may have the
problem we'd like to avoid. AFAICT, intel_pmu_lbr_add() blats cpuc->br_sel with
the branch selection of the last event added, and

So I took a look at what happens on my x86-64 desktop running v5.10.0-9-amd64
from Debian 11.

Running the following program:

| int main (int argc, char *argv[])
| {
| for (;;) {
| asm volatile("" ::: "memory");
| }
|
| return 0;
| }

I set /proc/sys/kernel/perf_event_paranoid to 2 and started two independent
perf sessions:

perf record -e cycles:u -j any -o perf-user.data -p 1320224

sudo perf record -e cycles:k -j any -o perf-kernel.data -p 1320224

.. after ~10 seconds, I killed both sessions with ^C.

When i susbsequently do 'perf report -i perf-kernel.data, I see:

| Samples: 295 of event 'cycles:k', Event count (approx.): 295
| Overhead Command Source Shared Object Source Symbol Target Symbol Basic Block Cycles
| 99.66% loop loop [k] main [k] main -
| 0.34% loop [kernel.kallsyms] [k] native_irq_return_iret [k] main -

.. where the user symbols are surprising.

Similarly for 'perf report -i perf-user.data', I see:

| Samples: 198K of event 'cycles:u', Event count (approx.): 198739
| Overhead Command Source Shared Object Source Symbol Target Symbol Basic Block Cycles
| 99.99% loop loop [.] main [.] main -
| 0.00% loop [unknown] [.] 0xffffffff87801007 [.] main -
| 0.00% loop [unknown] [.] 0xffffffff86e05626 [.] 0xffffffff86e05629 -
| 0.00% loop [unknown] [.] 0xffffffff86e0563d [.] 0xffffffff86e0c850 -
| 0.00% loop [unknown] [.] 0xffffffff86e0c86f [.] 0xffffffff86e6b3f0 -
| 0.00% loop [unknown] [.] 0xffffffff86e0c884 [.] 0xffffffff86e11ed0 -
| 0.00% loop [unknown] [.] 0xffffffff86e0c88a [.] 0xffffffff86e13850 -
| 0.00% loop [unknown] [.] 0xffffffff86e11eee [.] 0xffffffff86e0c889 -
| 0.00% loop [unknown] [.] 0xffffffff86e13885 [.] 0xffffffff86e13888 -
| 0.00% loop [unknown] [.] 0xffffffff86e13889 [.] 0xffffffff86e138a1 -
| 0.00% loop [unknown] [.] 0xffffffff86e138a9 [.] 0xffffffff86e6b320 -
| 0.00% loop [unknown] [.] 0xffffffff86e138c3 [.] 0xffffffff86e6b3f0 -
| 0.00% loop [unknown] [.] 0xffffffff86e6b33a [.] 0xffffffff86e138ae -
| 0.00% loop [unknown] [.] 0xffffffff86e6b3fb [.] 0xffffffff86e0c874 -
| 0.00% loop [unknown] [.] 0xffffffff86ff6c91 [.] 0xffffffff87a01ca0 -
| 0.00% loop [unknown] [.] 0xffffffff87a01ca0 [.] 0xffffffff87a01ca5 -
| 0.00% loop [unknown] [.] 0xffffffff87a01ca5 [.] 0xffffffff87a01cb1 -
| 0.00% loop [unknown] [.] 0xffffffff87a01cb5 [.] 0xffffffff86e05600 -

Where the unknown (kernel!) samples are surprising.

Peter, do you have any opinion on this?

My thinking is that the "last scheduled event branch selection wins"
isn't the behaviour we actually want, and either:

(a) Conflicting events shouldn't be scheduled concurrently (e.g. treat
that like running out of counters).

(b) The HW filters should be configured to allow anything permited by
any of the events, and SW filtering should remove the unexpected
records on a per-event basis.

.. but I imagine (b) is hard maybe? I don't know if LBR tells you which
CPU mode the src/dst were in.

Mark.