Re: [RFC 00/48] perf tools: Introduce data type profiling (v1)

From: Peter Zijlstra
Date: Thu Oct 12 2023 - 05:15:19 EST



W00t!! Finally! :-)

On Wed, Oct 11, 2023 at 08:50:23PM -0700, Namhyung Kim wrote:

> * How to use it
>
> To get precise memory access samples, users can use `perf mem record`
> command to utilize those events supported by their architecture. Intel
> machines would work best as they have dedicated memory access events but
> they would have a filter to ignore low latency loads like less than 30
> cycles (use --ldlat option to change the default value).
>
> # To get memory access samples in kernel for 1 second (on Intel)
> $ sudo perf mem record -a -K --ldlat=4 -- sleep 1

Fundamentally this should work with anything PEBS from MEM_ as
well, no? No real reason to rely on perf mem for this.

> In perf report, it's just a matter of selecting new sort keys: 'type'
> and 'typeoff'. The 'type' shows name of the data type as a whole while
> 'typeoff' shows name of the field in the data type. I found it useful
> to use it with --hierarchy option to group relevant entries in the same
> level.
>
> $ sudo perf report -s type,typeoff --hierarchy --stdio
> ...
> #
> # Overhead Data Type / Data Type Offset
> # ........... ............................
> #
> 23.95% (stack operation)
> 23.95% (stack operation) +0 (no field)
> 23.43% (unknown)
> 23.43% (unknown) +0 (no field)
> 10.30% struct pcpu_hot
> 4.80% struct pcpu_hot +0 (current_task)
> 3.53% struct pcpu_hot +8 (preempt_count)
> 1.88% struct pcpu_hot +12 (cpu_number)
> 0.07% struct pcpu_hot +24 (top_of_stack)
> 0.01% struct pcpu_hot +40 (softirq_pending)
> 4.25% struct task_struct
> 1.48% struct task_struct +2036 (rcu_read_lock_nesting)
> 0.53% struct task_struct +2040 (rcu_read_unlock_special.b.blocked)
> 0.49% struct task_struct +2936 (cred)
> 0.35% struct task_struct +3144 (audit_context)
> 0.19% struct task_struct +46 (flags)
> 0.17% struct task_struct +972 (policy)
> 0.15% struct task_struct +32 (stack)
> 0.15% struct task_struct +8 (thread_info.syscall_work)
> 0.10% struct task_struct +976 (nr_cpus_allowed)
> 0.09% struct task_struct +2272 (mm)
> ...
>
> The (stack operation) and (unknown) have no type and field info. FYI,
> the stack operations are samples in PUSH, POP or RET instructions which
> save or restore registers from/to the stack. They are usually parts of
> function prologue and epilogue and have no type info. The next is the
> struct pcpu_hot and you can see the first field (current_task) at offset
> 0 was accessed mostly. It's listed in order of access frequency (not in
> offset) as you can see it in the task_struct.
>
> In perf annotate, new --data-type option was added to enable data
> field level annotation. Now it only shows number of samples for each
> field but we can improve it.
>
> $ sudo perf annotate --data-type
> Annotate type: 'struct pcpu_hot' in [kernel.kallsyms] (223 samples):
> ============================================================================
> samples offset size field
> 223 0 64 struct pcpu_hot {
> 223 0 64 union {
> 223 0 48 struct {
> 78 0 8 struct task_struct* current_task;
> 98 8 4 int preempt_count;
> 45 12 4 int cpu_number;
> 0 16 8 u64 call_depth;
> 1 24 8 long unsigned int top_of_stack;
> 0 32 8 void* hardirq_stack_ptr;
> 1 40 2 u16 softirq_pending;
> 0 42 1 bool hardirq_stack_inuse;
> };
> 223 0 64 u8* pad;
> };
> };
> ...
>
> This shows each struct one by one and field-level access info in C-like
> style. The number of samples for the outer struct is a sum of number of
> samples in every field in the struct. In unions, each field is placed
> in the same offset so they will have the same number of samples.

This is excellent -- and pretty much what I've been asking for forever.

Would it be possible to have multiple sample columns, for eg.
MEM_LOADS_UOPS_RETIRED.L1_HIT and MEM_LOADS_UOPS_RETIRED.L1_MISS
or even more (adding LLC hit and miss as well etc.) ?

(for bonus points: --data-type=typename, would be awesome)

Additionally, annotating the regular perf-annotate output with data-type
information (where we have it) might also be very useful. That way, even
when profiling with PEBS-cycles, an expensive memop immediately gives a
clue as to what data-type to look at.

> No TUI support yet.

Yeah, nobody needs that anyway :-)

> This can generate instructions like below.
>
> ...
> 0x123456: mov 0x18(%rdi), %rcx
> 0x12345a: mov 0x10(%rcx), %rax <=== sample
> 0x12345e: test %rax, %rax
> 0x123461: je <...>
> ...
>
> And imagine we have a sample at 0x12345a. Then it cannot find a
> variable for %rcx since DWARF didn't generate one (it only knows about
> 'bar'). Without compiler support, all it can do is to track the code
> execution in each instruction and propagate the type info in each
> register and stack location by following the memory access.

Right, this has more or less been the 'excuse' for why doing this has
been 'difficult' for the past 10+ years :/

> Actually I found a discussion in the DWARF mailing list to support
> "inverted location lists" and it seems a perfect fit for this project.
> It'd be great if new DWARF would provide a way to lookup variable and
> type info using a concrete location info (like a register number).
>
> https://lists.dwarfstd.org/pipermail/dwarf-discuss/2023-June/002278.html

Stephane was going to talk to tools people about this over 10 years ago
:-)

Thanks for *finally* getting this started!!