Re: [RFC] perf tool improvement requests

From: Stephane Eranian
Date: Tue Sep 04 2018 - 11:50:24 EST


Arnaldo,

On Tue, Sep 4, 2018 at 6:42 AM Arnaldo Carvalho de Melo <acme@xxxxxxxxxx> wrote:
>
> Em Tue, Sep 04, 2018 at 09:10:49AM +0200, Peter Zijlstra escreveu:
> > On Mon, Sep 03, 2018 at 07:45:48PM -0700, Stephane Eranian wrote:
> > > A few weeks ago, you had asked if I had more requests for the perf tool.
>
> > I have one long standing one; that is IP based data structure
> > annotation.
>
> > When we get an exact IP (using PEBS) and were sampling a data related
> > event (say L1 misses), we can get the data type from the instruction
> > itself; that is, through DWARF. We _know_ what type (structure::member)
> > is read/written to.
>
I have been asking this from the compiler people for a long time!
I don't think it is there. I'd like each load/store to be annotated
with a data type + offset
within the type. It would allow data type profiling. This would not be
bulletproof though
because of the accessor function problem:
void incr(int *v) { (*v)++; }
struct foo { int a, int b } bar;
incr(&bar.a);

Here the load/store in incr() would see an int pointer, not an int
inside struct foo at offset 0 which
is what we want. There are concern with the volume of data that this
would generate. But my argument
is that this is just debug binaries, does not make the stripped binary
any bigger.


> > I would love to get that in a pahole style output.
>
Yes, me too!

> > Better yet, when you measure both hits and misses, you can get a
> > structure usage overview, and see what lines are used lots and what
> > members inside that line are rarely used. Ideal information for data
> > structure layout optimization.
>
> > 1000x more useful than that c2c crap.
>

c2c is about something else: more about NUMA issues and false sharing.

> > Can we please get that?
>
> So, use 'c2c record' to get the samples:
>
> [root@jouet ~]# perf c2c record
> ^C[ perf record: Woken up 1 times to write data ]
> [ perf record: Captured and wrote 5.152 MB perf.data (4555 samples) ]
>
> Events collected:
>
> [root@jouet ~]# perf evlist -v
> cpu/mem-loads,ldlat=30/P: type: 4, size: 112, config: 0x1cd, { sample_period, sample_freq }: 4000, sample_type: IP|TID|TIME|ADDR|ID|CPU|PERIOD|DATA_SRC|WEIGHT|PHYS_ADDR, read_format: ID, disabled: 1, inherit: 1, mmap: 1, comm: 1, freq: 1, task: 1, precise_ip: 3, mmap_data: 1, sample_id_all: 1, mmap2: 1, comm_exec: 1, { bp_addr, config1 }: 0x1f
> cpu/mem-stores/P: type: 4, size: 112, config: 0x82d0, { sample_period, sample_freq }: 4000, sample_type: IP|TID|TIME|ADDR|ID|CPU|PERIOD|DATA_SRC|WEIGHT|PHYS_ADDR, read_format: ID, disabled: 1, inherit: 1, freq: 1, precise_ip: 3, sample_id_all: 1
>
> Then we'll get a 'annotate --hits' option (just cooked up, will
> polish) that will show the name of the function, info about it globally,
> i.e. what annotate already produced, we may get this in CSV for better
> post processing consumption:
>
> [root@jouet ~]# perf annotate --hits kmem_cache_alloc
> Samples: 20 of event 'cpu/mem-loads,ldlat=30/P', 4000 Hz, Event count (approx.): 875, [percent: local period]
> kmem_cache_alloc() /usr/lib/debug/lib/modules/4.17.17-100.fc27.x86_64/vmlinux
> 4.91 15: mov gfp_allowed_mask,%ebx
> 2.51 51: mov (%r15),%r8
> 17.14 54: mov %gs:0x8(%r8),%rdx
> 6.51 61: cmpq $0x0,0x10(%r8)
> 17.14 66: mov (%r8),%r14
> 6.29 78: mov 0x20(%r15),%ebx
> 5.71 7c: mov (%r15),%rdi
> 29.49 85: xor 0x138(%r15),%rbx
> 2.86 9d: lea (%rdi),%rsi
> 3.43 d7: pop %rbx
> 2.29 dc: pop %r12
> 1.71 ed: testb $0x4,0xb(%rbp)
> [root@jouet ~]#
>
How does this related to what Peter was asking? It has nothing about data types.

What I'd like is a true data type profiler showing you the most
accesses data types.
and then an annotate-mode showing you which fields inside the types are mostly
read or written with their sizes and alignment. Goal is to improve
layout based on
accesses to minimize the number of cachelines moved.
You need DLA sampling on all loads and stores and then type annotation.
As I said, I have prototyped this for self-sampling programs but not
in the perf tool.
It is harder there because you need type information and heap information.
I think DWARF is one way to go assuming it is extended to support the right kind
of load/store annotations. Another way is to track allocations and
correlate to data
types.



> Then I need to get the DW_AT_location stuff parsed in pahole, so
> that with those offsets (second column, ending with :) with hits (first
> column, there its local period, but we can ask for some specific metric
> [1]), I'll be able to figure out what DW_TAG_variable or
> DW_TAG_formal_parameter is living there at that time, get the offset
> from the decoded instruction, say that xor, 0x138 offset from the type
> for %r15 at that offset (85) from kmem_cache_alloc, right?
>
> In a first milestone we'd have something like:
>
> perf annotate --hits function | pahole --annotate -C task_struct
>
> perf annotate --hits | pahole --annotate
>
I don't want to combine tools. I'd like this to be built into perf.

> Would show all structs with hits, for all functions with hits.
>
> Other options would show which struct has more hits, etc.
>
> - Arnaldo
>
> [1]
>
> [root@jouet ~]# perf annotate -h local
>
> Usage: perf annotate [<options>]
>
> --percent-type <local-period>
> Set percent type local/global-period/hits
>
> [root@jouet ~]#
>
> - Arnaldo