Re: [RFC] perf tool improvement requests

From: Arnaldo Carvalho de Melo
Date: Tue Sep 04 2018 - 09:42:30 EST


Em Tue, Sep 04, 2018 at 09:10:49AM +0200, Peter Zijlstra escreveu:
> On Mon, Sep 03, 2018 at 07:45:48PM -0700, Stephane Eranian wrote:
> > A few weeks ago, you had asked if I had more requests for the perf tool.

> I have one long standing one; that is IP based data structure
> annotation.

> When we get an exact IP (using PEBS) and were sampling a data related
> event (say L1 misses), we can get the data type from the instruction
> itself; that is, through DWARF. We _know_ what type (structure::member)
> is read/written to.

> I would love to get that in a pahole style output.

> Better yet, when you measure both hits and misses, you can get a
> structure usage overview, and see what lines are used lots and what
> members inside that line are rarely used. Ideal information for data
> structure layout optimization.

> 1000x more useful than that c2c crap.

> Can we please get that?

So, use 'c2c record' to get the samples:

[root@jouet ~]# perf c2c record
^C[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 5.152 MB perf.data (4555 samples) ]

Events collected:

[root@jouet ~]# perf evlist -v
cpu/mem-loads,ldlat=30/P: type: 4, size: 112, config: 0x1cd, { sample_period, sample_freq }: 4000, sample_type: IP|TID|TIME|ADDR|ID|CPU|PERIOD|DATA_SRC|WEIGHT|PHYS_ADDR, read_format: ID, disabled: 1, inherit: 1, mmap: 1, comm: 1, freq: 1, task: 1, precise_ip: 3, mmap_data: 1, sample_id_all: 1, mmap2: 1, comm_exec: 1, { bp_addr, config1 }: 0x1f
cpu/mem-stores/P: type: 4, size: 112, config: 0x82d0, { sample_period, sample_freq }: 4000, sample_type: IP|TID|TIME|ADDR|ID|CPU|PERIOD|DATA_SRC|WEIGHT|PHYS_ADDR, read_format: ID, disabled: 1, inherit: 1, freq: 1, precise_ip: 3, sample_id_all: 1

Then we'll get a 'annotate --hits' option (just cooked up, will
polish) that will show the name of the function, info about it globally,
i.e. what annotate already produced, we may get this in CSV for better
post processing consumption:

[root@jouet ~]# perf annotate --hits kmem_cache_alloc
Samples: 20 of event 'cpu/mem-loads,ldlat=30/P', 4000 Hz, Event count (approx.): 875, [percent: local period]
kmem_cache_alloc() /usr/lib/debug/lib/modules/4.17.17-100.fc27.x86_64/vmlinux
4.91 15: mov gfp_allowed_mask,%ebx
2.51 51: mov (%r15),%r8
17.14 54: mov %gs:0x8(%r8),%rdx
6.51 61: cmpq $0x0,0x10(%r8)
17.14 66: mov (%r8),%r14
6.29 78: mov 0x20(%r15),%ebx
5.71 7c: mov (%r15),%rdi
29.49 85: xor 0x138(%r15),%rbx
2.86 9d: lea (%rdi),%rsi
3.43 d7: pop %rbx
2.29 dc: pop %r12
1.71 ed: testb $0x4,0xb(%rbp)
[root@jouet ~]#

Then I need to get the DW_AT_location stuff parsed in pahole, so
that with those offsets (second column, ending with :) with hits (first
column, there its local period, but we can ask for some specific metric
[1]), I'll be able to figure out what DW_TAG_variable or
DW_TAG_formal_parameter is living there at that time, get the offset
from the decoded instruction, say that xor, 0x138 offset from the type
for %r15 at that offset (85) from kmem_cache_alloc, right?

In a first milestone we'd have something like:

perf annotate --hits function | pahole --annotate -C task_struct

perf annotate --hits | pahole --annotate

Would show all structs with hits, for all functions with hits.

Other options would show which struct has more hits, etc.

- Arnaldo

[1]

[root@jouet ~]# perf annotate -h local

Usage: perf annotate [<options>]

--percent-type <local-period>
Set percent type local/global-period/hits

[root@jouet ~]#

- Arnaldo