Re: [PATCHv3 00/12] perf tools: Display eBPF code in intel_pt trace

From: Arnaldo Carvalho de Melo
Date: Thu May 30 2019 - 09:40:28 EST


Em Thu, May 30, 2019 at 08:57:09PM +0800, Leo Yan escreveu:
> Hi Jiri,
>
> On Thu, May 30, 2019 at 02:07:10PM +0200, Jiri Olsa wrote:
> > On Thu, May 30, 2019 at 06:54:39PM +0800, Leo Yan wrote:
> > > Hi Jiri,
> > >
> > > On Wed, May 08, 2019 at 03:19:58PM +0200, Jiri Olsa wrote:
> > > > hi,
> > > > this patchset adds dso support to read and display
> > > > bpf code in intel_pt trace output. I had to change
> > > > some of the kernel maps processing code, so hopefully
> > > > I did not break too many things ;-)
> > > >
> > > > It's now possible to see bpf code flow via:
> > > >
> > > > # perf-with-kcore record pt -e intel_pt//ku -- sleep 1
> > > > # perf-with-kcore script pt --insn-trace --xed
> > >
> > > This is very interesting work for me!
> > >
> > > I want to verify this feature with Arm CoreSight trace, I have one
> > > question so that I have more direction for the tesing:
> > >
> > > What's the bpf program you are suing for the testing? e.g. some
> > > testing program under the kernel's folder $kernel/samples/bpf?
> > > Or you uses perf command to launch bpf program?
> >
> > for this I was using tools/testing/selftests/bpf/test_verifier
> >
> > I isolated some tests and ran the perf on top of them, like:
> >
> > # perf-with-kcore record pt -e intel_pt//ku -- ./test_verifier ...
>
> Thanks a lot for sharing the info and quick responsing.
>
> I tried to use the program $kernel/samples/bpf/sampleip to verify this
> patch set, but seems eBPF dso is not contained properly; below is my
> detailed steps:
>
> # In the first tty
> # cd $kernel/samples/bpf/
> # ./sampleip -F 200 20 => sample ip with 200Hz for 20s
>
> # In the second tty
> # perf-with-kcore record arm_test -e cs_etm/@20070000.etr/uk -- sleep 1
>
> If I output DSO info with report command it give below info, which
> doesn't contain any info for eBPF DSO?
>
> # perf-with-kcore report arm_test -F,dso
>
> # Samples: 6M of event 'branches:ku'
> # Event count (approx.): 6340896
> #
> # Shared Object
> # ..................
> #
> [kernel.kallsyms]
> ld-2.28.so
> libc-2.28.so
> libpthread-2.28.so
> perf
> sleep
> [unknown]

One other way of testing this:

I used perf trace's use of BPF, using:

[root@quaco ~]# cat ~/.perfconfig
[llvm]
dump-obj = true
clang-opt = -g
[trace]
add_events = /home/acme/git/perf/tools/perf/examples/bpf/augmented_raw_syscalls.c
show_zeros = yes
show_duration = no
no_inherit = yes
show_timestamp = no
show_arg_names = no
args_alignment = 40
show_prefix = yes

For arm64 this needs fixing, tools/perf/examples/bpf/augmented_raw_syscalls.c
(its in the kernel sources) is still hard coded for x86_64 syscall numbers :-\

Use some other BPF workload, one that has its programs being hit so that
we get samples in it, in my case running:

# perf trace

For system wide strace like tracing, which hooks in the
raw_syscalls:sys_enter and raw_syscalls:sys_exit tracepoints, I got:

[root@quaco ~]# perf record -a sleep 2s
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 3.136 MB perf.data (32020 samples) ]
[root@quaco ~]# perf report --stdio -s dso | head -40
# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 32K of event 'cycles'
# Event count (approx.): 6201422424
#
# Overhead Shared Object
# ........ ...................................
#
80.92% [kernel.vmlinux]
5.14% libxul.so
3.60% libc-2.28.so
1.88% sshd
1.18% trace
1.02% [nf_conntrack]
0.73% firefox
0.57% libcrypto.so.1.1.1b
0.50% libpthread-2.28.so
0.45% [r8152]
0.44% bpf_prog_2cf7fd321e9b084c_sys_enter
0.43% [JIT] tid 10475
0.39% bpf_prog_c1bd85c092d6e4aa_sys_exit
0.33% [unknown]
0.28% [JIT] tid 10033
0.21% [JIT] tid 7243
0.21% libglib-2.0.so.0.5800.3
0.14% [nf_nat]
0.13% [JIT] tid 9488
0.13% [vdso]
0.12% ld-2.28.so
0.09% [iptable_raw]
0.08% [kvm]
0.08% libpulsecore-12.2.so
0.07% [iptable_mangle]
0.06% libjansson.so.4.11.0
0.06% libpulsecommon-12.2.so
0.05% [nf_defrag_ipv4]
0.04% libasound.so.2.0.0
[root@quaco ~]#

So you see those bpf_prog_2cf7fd321e9b084c_sys_enter and
bpf_prog_c1bd85c092d6e4aa_sys_exit DSOs, those are the BPF proggies that
'perf trace' used clang to build and then hooked into the raw_syscalls
tracepoints.

If I furthermore ask for annotating the only symbol in that DSO, I get
source code intermixed with the BPF jitted code, make sure you use a
recent enough clang version, here I'm using, looking at the dumped .o
file (asked via ~/.perfconfig's llvm.dump-obj=y that has BTF info
generated by clang because we have also llvm.clang-opt=-g there):

[root@quaco ~]# file /home/acme/git/perf/tools/perf/examples/bpf/augmented_raw_syscalls.o
/home/acme/git/perf/tools/perf/examples/bpf/augmented_raw_syscalls.o: ELF 64-bit LSB relocatable, eBPF, version 1 (SYSV), with debug_info, not stripped
[root@quaco ~]# eu-readelf -winfo /home/acme/git/perf/tools/perf/examples/bpf/augmented_raw_syscalls.o | grep -w producer
producer (strp) "clang version 9.0.0 (https://git.llvm.org/git/clang.git/ 7906282d3afec5dfdc2b27943fd6c0309086c507) (https://git.llvm.org/git/llvm.git/ a1b5de1ff8ae8bc79dc8e86e1f82565229bd0500)"
[root@quaco ~]# readelf -SW /home/acme/git/perf/tools/perf/examples/bpf/augmented_raw_syscalls.o | grep -i BTF
[23] .BTF PROGBITS 0000000000000000 00229c 000c03 00 0 0 1
[24] .BTF.ext PROGBITS 0000000000000000 002e9f 000300 00 0 0 1
[25] .rel.BTF.ext REL 0000000000000000 004700 0002d0 10 31 24 8
[root@quaco ~]#

# perf annotate --stdio2 bpf_prog_2cf7fd321e9b084c_sys_enter
Samples: 107 of event 'cycles', 4000 Hz, Event count (approx.): 27440959, [percent: local period]
bpf_prog_2cf7fd321e9b084c_sys_enter() bpf_prog_2cf7fd321e9b084c_sys_enter
Percent int sys_enter(struct syscall_enter_args *args)
65.78 push %rbp

0.48 mov %rsp,%rbp
sub $0x30,%rsp
1.41 sub $0x28,%rbp
10.66 mov %rbx,0x0(%rbp)
2.63 mov %r13,0x8(%rbp)
mov %r14,0x10(%rbp)
0.67 mov %r15,0x18(%rbp)
0.96 xor %eax,%eax
mov %rax,0x20(%rbp)
mov %rdi,%rbx
xor %edi,%edi
int key = 0;
0.70 mov %edi,-0x8(%rbp)
mov %rbp,%rsi
int sys_enter(struct syscall_enter_args *args)
0.68 add $0xfffffffffffffff8,%rsi
augmented_args = bpf_map_lookup_elem(&augmented_filename_map, &key);
0.70 movabs $0xffff9c4bdf9d3800,%rdi

0.70 â callq *fffffffff98e80d5
mov %rax,%r13
mov $0x1,%eax
if (augmented_args == NULL)
4.20 cmp $0x0,%r13
â je 2b1
return bpf_get_current_pid_tgid();
â callq *fffffffff98e4ac5
0.70 mov %eax,-0x4(%rbp)
mov %rbp,%rsi
add $0xfffffffffffffffc,%rsi
return bpf_map_lookup_elem(pids, &pid) != NULL;
0.70 movabs $0xffff9c4f01d6a000,%rdi

0.67 â callq *fffffffff98e6085
cmp $0x0,%rax
â je 7d
add $0x38,%rax
1.80 7d: mov %rax,%rdi

xor %eax,%eax
if (pid_filter__has(&pids_filtered, getpid()))
cmp $0x0,%rdi
â jne 2b1
probe_read(&augmented_args->args, sizeof(augmented_args->args), args);
0.29 mov %r13,%rdi
mov $0x40,%esi
mov %rbx,%rdx
â callq *fffffffff98c4025
syscall = bpf_map_lookup_elem(&syscalls, &augmented_args->args.syscall_nr);
mov %r13,%r14
add $0x8,%r14
syscall = bpf_map_lookup_elem(&syscalls, &augmented_args->args.syscall_nr);
movabs $0xffff9c4efe8aa000,%rdi

mov %r14,%rsi
add $0xd0,%rdi
mov 0x0(%rsi),%eax
cmp $0x200,%rax
â jae cc
shl $0x3,%rax

add %rdi,%rax
â jmp ce
cc: xor %eax,%eax
ce: mov %rax,%rdi

xor %eax,%eax
if (syscall == NULL || !syscall->enabled)
cmp $0x0,%rdi
â je 2b1
if (syscall == NULL || !syscall->enabled)
0.91 movzbq 0x0(%rdi),%rdi


xor %eax,%eax
if (syscall == NULL || !syscall->enabled)
cmp $0x0,%rdi
â je 2b1
mov $0x40,%r8d
switch (augmented_args->args.syscall_nr) {
mov 0x0(%r14),%rdi
switch (augmented_args->args.syscall_nr) {
cmp $0xef,%rdi
â jg 188
mov %rdi,%rdx
add $0xffffffffffffff65,%rdx
cmp $0x2c,%rdx
â ja 11a
â jmpq 20a
11a: mov %rdi,%rsi
add $0xffffffffffffffb4,%rsi
cmp $0x3d,%rsi
â ja 14b
mov $0x1,%edx
push %rcx

mov %rsi,%rcx
shl %cl,%rdx

pop %rcx

movabs $0x2200000000057fd1,%rsi

and %rsi,%rdx
cmp $0x0,%rdx
â jne 24b
14b: cmp $0x3b,%rdi
â ja 297
mov $0x1,%esi
push %rcx

mov %rdi,%rcx
shl %cl,%rsi

pop %rcx

movabs $0x800000000200054,%rdx

and %rdx,%rsi
cmp $0x0,%rsi
â jne 24b
cmp $0x12,%rdi
â je 245
â jmpq 297
188: mov %rdi,%rdx
add $0xffffffffffffff10,%rdx
cmp $0x3f,%rdx
â ja 1cc
mov $0x1,%esi
push %rcx

mov %rdx,%rcx
shl %cl,%rsi

pop %rcx

movabs $0x800001003bfe4004,%rdx

mov %rsi,%rcx
and %rdx,%rcx
cmp $0x0,%rcx
â jne 245
and $0x4000303,%rsi
cmp $0x0,%rsi
â jne 24b
1cc: mov %rdi,%rsi
add $0xfffffffffffffec7,%rsi
cmp $0x13,%rsi
â ja 1fc
mov $0x1,%edx
push %rcx

mov %rsi,%rcx
shl %cl,%rdx

pop %rcx

and $0x80209,%rdx
cmp $0x0,%rdx
â jne 245
cmp $0x6,%rsi
â je 24b
1fc: cmp $0x1b1,%rdi
â je 245
â jmpq 297
20a: mov $0x1,%esi
push %rcx

mov %rdx,%rcx
shl %cl,%rsi

pop %rcx

movabs $0xdb600203141,%rdx

mov %rsi,%rcx
and %rdx,%rcx
cmp $0x0,%rcx
â jne 24b
movabs $0x104801000000,%rdx

and %rdx,%rsi
cmp $0x0,%rsi
â jne 245
â jmpq 11a
case SYS_OPENAT: filename_arg = (const void *)args->args[1];
245: mov 0x18(%rbx),%rdx
â jmp 24f
case SYS_OPEN: filename_arg = (const void *)args->args[0];
24b: mov 0x10(%rbx),%rdx
if (filename_arg != NULL) {
24f: cmp $0x0,%rdx
â je 297
xor %edi,%edi
augmented_args->filename.reserved = 0;
mov %edi,0x44(%r13)
augmented_args->filename.size = probe_read_str(&augmented_args->filename.value,
mov %r13,%rdi
add $0x48,%rdi
augmented_args->filename.size = probe_read_str(&augmented_args->filename.value,
mov $0x1000,%esi
â callq *fffffffff98c4135
augmented_args->filename.size = probe_read_str(&augmented_args->filename.value,
mov %eax,0x40(%r13)
mov $0x1048,%r8d
augmented_args->filename.size = probe_read_str(&augmented_args->filename.value,
mov %rax,%rdi
shl $0x20,%rdi

shr $0x20,%rdi

if (augmented_args->filename.size < sizeof(augmented_args->filename.value)) {
cmp $0xfff,%rdi
â ja 297
len -= sizeof(augmented_args->filename.value) - augmented_args->filename.size;
add $0x48,%rax
len &= sizeof(augmented_args->filename.value) - 1;
and $0xfff,%rax
mov %rax,%r8
return perf_event_output(args, &__augmented_syscalls__, BPF_F_CURRENT_CPU, augmented_args, len);
297: mov %rbx,%rdi
movabs $0xffff9c4bdf9d3a00,%rsi

mov $0xffffffff,%edx
mov %r13,%rcx
â callq *fffffffff98c4775
}
3.52 2b1: mov 0x0(%rbp),%rbx
mov 0x8(%rbp),%r13
mov 0x10(%rbp),%r14
mov 0x18(%rbp),%r15
add $0x28,%rbp
1.84 leaveq
â retq

Look at the perf.data file for the required records, included in Linux
v5.1:

[root@quaco ~]# perf report -D | egrep 'PERF_RECORD_(BPF|KSYMBOL)'
0 0 0x54f0 [0x50]: PERF_RECORD_KSYMBOL addr ffffffffc0240484 len 229 type 1 flags 0x0 name bpf_prog_7be49e3934a125ba
0 0 0x5540 [0x30]: PERF_RECORD_BPF_EVENT type 1, flags 0, id 13
0 0 0x5570 [0x50]: PERF_RECORD_KSYMBOL addr ffffffffc0283679 len 229 type 1 flags 0x0 name bpf_prog_2a142ef67aaad174
0 0 0x55c0 [0x30]: PERF_RECORD_BPF_EVENT type 1, flags 0, id 14
0 0 0x55f0 [0x50]: PERF_RECORD_KSYMBOL addr ffffffffc023cef7 len 229 type 1 flags 0x0 name bpf_prog_7be49e3934a125ba
0 0 0x5640 [0x30]: PERF_RECORD_BPF_EVENT type 1, flags 0, id 15
0 0 0x5670 [0x50]: PERF_RECORD_KSYMBOL addr ffffffffc0242837 len 229 type 1 flags 0x0 name bpf_prog_2a142ef67aaad174
0 0 0x56c0 [0x30]: PERF_RECORD_BPF_EVENT type 1, flags 0, id 16
0 0 0x56f0 [0x50]: PERF_RECORD_KSYMBOL addr ffffffffc046d593 len 229 type 1 flags 0x0 name bpf_prog_7be49e3934a125ba
0 0 0x5740 [0x30]: PERF_RECORD_BPF_EVENT type 1, flags 0, id 17
0 0 0x5770 [0x50]: PERF_RECORD_KSYMBOL addr ffffffffc046f397 len 229 type 1 flags 0x0 name bpf_prog_2a142ef67aaad174
0 0 0x57c0 [0x30]: PERF_RECORD_BPF_EVENT type 1, flags 0, id 18
0 0 0x57f0 [0x50]: PERF_RECORD_KSYMBOL addr ffffffffc0d98039 len 229 type 1 flags 0x0 name bpf_prog_7be49e3934a125ba
0 0 0x5840 [0x30]: PERF_RECORD_BPF_EVENT type 1, flags 0, id 21
0 0 0x5870 [0x50]: PERF_RECORD_KSYMBOL addr ffffffffc0d9ad98 len 229 type 1 flags 0x0 name bpf_prog_2a142ef67aaad174
0 0 0x58c0 [0x30]: PERF_RECORD_BPF_EVENT type 1, flags 0, id 22
0 0 0x58f0 [0x58]: PERF_RECORD_KSYMBOL addr ffffffffc092241b len 711 type 1 flags 0x0 name bpf_prog_2cf7fd321e9b084c_sys_enter
0 0 0x5948 [0x30]: PERF_RECORD_BPF_EVENT type 1, flags 0, id 286
0 0 0x5978 [0x58]: PERF_RECORD_KSYMBOL addr ffffffffc0b0d2d4 len 191 type 1 flags 0x0 name bpf_prog_c1bd85c092d6e4aa_sys_exit
0 0 0x59d0 [0x30]: PERF_RECORD_BPF_EVENT type 1, flags 0, id 287
[root@quaco ~]# perf evlist -v
cycles: size: 112, { sample_period, sample_freq }: 4000, sample_type: IP|TID|TIME|CPU|PERIOD, read_format: ID, disabled: 1, inherit: 1, mmap: 1, comm: 1, freq: 1, task: 1, precise_ip: 3, sample_id_all: 1, exclude_guest: 1, mmap2: 1, comm_exec: 1, ksymbol: 1, bpf_event: 1
[root@quaco ~]#

> > I had to add some small sleep before the test_verifier exit,
> > so the perf bpf thread could catch up and download the program
> > details before test_verifier exited.
>
> This seems to me for a 'real' eBPF program, do we also need extra sleep
> so that perf bpf can save dso properly?

See abovwe.

> BTW, I have another question: to display eBPF code, except this feature
> can be used by hardware tracing (e.g. intel_pt), it also can be used
> by other PMU events and timer events, right?

Yeah, again, see above.

> Thanks,
> Leo Yan

--

- Arnaldo