Re: [PATCH 00/18] [ANNOUNCE] Dynamically created function based events

From: Steven Rostedt
Date: Sat Feb 03 2018 - 13:53:30 EST



I need to add a "Quick guide" and "Tips and tricks" section to the
document. To explain the arguments better...

Simple args are:

"int val", "unsigned int val", "char x", "unsigned long addr",

Also the types:

"s32 val", "u32 val", "s8 x", "u64 addr"

The above are all printed in decimal "%d" or "%u", if you want hex...

"x32 val", "x8 x", "x64 addr"

If you want to have it use "%pS" to print (symbols)

"symbol addr" is like: "%pS", (void *)addr

Arrays can be expressed after the type:

"x8[6] mac" is like: "%x,%x,%x,%x,%x,%x", mac[0], mac[1], mac[2],
mac[3], mac[4], mac[5]

Where mac would be: unsigned char mac[6] type.

Note, arrays of type "char" and "unsigned char" turn into a static
string.

"char[10] str" is like: "%s", str

Where str is defined as char str[10];

If the argument is a pointer to a structure, you can index into the
structure:

"x64 ip[16]" is like: "%llx", ((u64 *)ip)[16]

Finally, if an argument is a pointer to a structure, and you want to
get to another structure that it points to, for example

struct sk_buff *skb;

and you want to get to:

skb->dev->perm_addr

when the parameter is a pointer to skb.

(gdb) printf "%d\n", &((struct sk_buff *)0)->dev
16
(gdb) printf "%d\n", &((struct net_device *)0)->perm_addr
558

The net_device *dev is 16 bytes into sk_buff, and the char array
perm_addr, is 558 bytes into the net_device structure.

Where perm_addr is an array of the mac address.

"x8[6] perm_addr+16[0]+558"

Which would turn into:

char *dev = (char **)(((void *)skb)+16)[0];
char *perm_addr = (char *)(dev+558);

"%x,%x,%x,%x,%x,%x", perm_addr[0], perm_addr[1], perm_addr[2],
perm_addr[3], perm_addr[4], perm_addr[5]

OK, the above is a bit complex ;-) But works nicely.

-- Steve

On Fri, 02 Feb 2018 18:04:58 -0500
Steven Rostedt <rostedt@xxxxxxxxxxx> wrote:

> At Kernel Summit back in October, we tried to bring up trace markers, which
> would be nops within the kernel proper, that would allow modules to hook
> arbitrary trace events to them. The reaction to this proposal was less than
> favorable. We were told that we were trying to make a work around for a
> problem, and not solving it. The problem in our minds is the notion of a
> "stable trace event".
>
> There are maintainers that do not want trace events, or more trace events in
> their subsystems. This is due to the fact that trace events post an
> interface to user space, and this interface could become required by some
> tool. This may cause the trace event to become stable where it must not
> break the tool, and thus prevent the code from changing.
>
> Or, the trace event may just have to add padding for fields that tools
> may require. The "success" field of the sched_wakeup trace event is one such
> instance. There is no more "success" variable, but tools may fail if it were
> to go away, so a "1" is simply added to the trace event wasting ring buffer
> real estate.
>
> I talked with Linus about this, and he told me that we already have these
> markers in the kernel. They are from the mcount/__fentry__ used by function
> tracing. Have the trace events be created by these, and see if this will
> satisfy most areas that want trace events.
>
> I decided to implement this idea, and here's the patch set.
>
> Introducing "function based events". These are created dynamically by a
> tracefs file called "function_events". By writing a pseudo prototype into
> this file, you create an event.
>
> # mount -t tracefs nodev /sys/kernel/tracing
> # cd /sys/kernel/tracing
> # echo 'do_IRQ(symbol ip[16] | x64[6] irq_stack[16])' > function_events
> # cat events/functions/do_IRQ/format
> name: do_IRQ
> ID: 1399
> format:
> field:unsigned short common_type; offset:0; size:2; signed:0;
> field:unsigned char common_flags; offset:2; size:1; signed:0;
> field:unsigned char common_preempt_count; offset:3; size:1; signed:0;
> field:int common_pid; offset:4; size:4; signed:1;
>
> field:unsigned long __parent_ip; offset:8; size:8; signed:0;
> field:unsigned long __ip; offset:16; size:8; signed:0;
> field:symbol ip; offset:24; size:8; signed:0;
> field:x64 irq_stack[6]; offset:32; size:48; signed:0;
>
> print fmt: "%pS->%pS(ip=%pS, irq_stack=%llx:%llx:%llx:%llx:%llx:%llx)", REC->__ip, REC->__parent_ip,
> REC->ip, REC->irq_stack[0], REC->irq_stack[1], REC->irq_stack[2], REC->irq_stack[3], REC->irq_stack[4],
> REC->irq_stack[5]
>
> # echo 1 > events/functions/do_IRQ/enable
> # cat trace
> <idle>-0 [003] d..3 3647.049344: ret_from_intr->do_IRQ(ip=cpuidle_enter_state+0xb1/0x330, irq_stack=ffffffff81665db1,10,246,ffffc900006c3e80,18,ffff88011eae9b40)
> <idle>-0 [003] d..3 3647.049433: ret_from_intr->do_IRQ(ip=cpuidle_enter_state+0xb1/0x330, irq_stack=ffffffff81665db1,10,246,ffffc900006c3e80,18,ffff88011eae9b40)
> <idle>-0 [003] d..3 3647.049672: ret_from_intr->do_IRQ(ip=cpuidle_enter_state+0xb1/0x330, irq_stack=ffffffff81665db1,10,246,ffffc900006c3e80,18,ffff88011eae9b40)
> <idle>-0 [003] d..3 3647.325709: ret_from_intr->do_IRQ(ip=cpuidle_enter_state+0xb1/0x330, irq_stack=ffffffff81665db1,10,246,ffffc900006c3e80,18,ffff88011eae9b40)
> <idle>-0 [003] d..3 3647.325929: ret_from_intr->do_IRQ(ip=cpuidle_enter_state+0xb1/0x330, irq_stack=ffffffff81665db1,10,246,ffffc900006c3e80,18,ffff88011eae9b40)
> <idle>-0 [003] d..3 3647.325993: ret_from_intr->do_IRQ(ip=cpuidle_enter_state+0xb1/0x330, irq_stack=ffffffff81665db1,10,246,ffffc900006c3e80,18,ffff88011eae9b40)
> <idle>-0 [003] d..3 3647.387571: ret_from_intr->do_IRQ(ip=cpuidle_enter_state+0xb1/0x330, irq_stack=ffffffff81665db1,10,246,ffffc900006c3e80,18,ffff88011eae9b40)
> <idle>-0 [003] d..3 3647.387791: ret_from_intr->do_IRQ(ip=cpuidle_enter_state+0xb1/0x330, irq_stack=ffffffff81665db1,10,246,ffffc900006c3e80,18,ffff88011eae9b40)
> <idle>-0 [003] d..3 3647.387874: ret_from_intr->do_IRQ(ip=cpuidle_enter_state+0xb1/0x330, irq_stack=ffffffff81665db1,10,246,ffffc900006c3e80,18,ffff88011eae9b40)
>
> And this is much more powerful than just this. We can show strings, and
> index off of structures into other structures.
>
> # echo '__vfs_read(symbol read+40[0]+16)' > function_events
>
> # echo 1 > events/functions/__vfs_read/enable
> # cat trace
> sshd-1343 [005] ...2 199.734752: vfs_read->__vfs_read(read=tty_read+0x0/0xf0)
> bash-1344 [003] ...2 199.734822: vfs_read->__vfs_read(read=tty_read+0x0/0xf0)
> sshd-1343 [005] ...2 199.734835: vfs_read->__vfs_read(read=tty_read+0x0/0xf0)
> avahi-daemon-910 [003] ...2 200.136740: vfs_read->__vfs_read(read= (null))
> avahi-daemon-910 [003] ...2 200.136750: vfs_read->__vfs_read(read= (null))
>
> And even read user space:
>
> # echo 'SyS_openat(int dfd, string path, x32 flags, x16 mode)' > function_events
> # echo 1 > events/functions/enable
> # grep task_fork /proc/kallsyms
> ffffffff810d5a60 t task_fork_fair
> ffffffff810dfc30 t task_fork_dl
> # cat trace
> grep-1820 [000] ...2 3926.107603: entry_SYSCALL_64_fastpath->SyS_openat(dfd=-100, path=/proc/kallsyms, flags=100, mode=0)
>
> These are fully functional events. That is, they work with ftrace,
> trace-cmd, perf, histograms, triggers, and eBPF.
>
> What's next? I need to rewrite the function graph tracer, and be able to add
> dynamic events on function return.
>
> I made this work with x86 with a simple function that only returns
> 6 function parameters for x86_64 and 3 for x86_32. But this could easily
> be extended.
>
> Cheers!
>
> git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace.git
> ftrace/dynamic-ftrace-events
>
> Head SHA1: 30fbdffd5d38bd27b04fb911f7158f10a99be8c4
>
>
> Steven Rostedt (VMware) (18):
> tracing: Add function based events
> tracing: Add documentation for function based events
> tracing: Add simple arguments to function based events
> tracing/x86: Add arch_get_func_args() function
> tracing: Add hex print for dynamic ftrace based events
> tracing: Add indirect offset to args of ftrace based events
> tracing: Add dereferencing multiple fields per arg
> tracing: Add "unsigned" to function based events
> tracing: Add indexing of arguments for function based events
> tracing: Make func_type enums for easier comparing of arg types
> tracing: Add symbol type to function based events
> tracing: Add accessing direct address from function based events
> tracing: Add array type to function based events
> tracing: Have char arrays be strings for function based events
> tracing: Add string type for dynamic strings in function based events
> tracing: Add NULL to skip args for function based events
> tracing: Add indirect to indirect access for function based events
> tracing/perf: Allow perf to use function based events
>
> ----
> Documentation/trace/function-based-events.rst | 426 ++++++++
> arch/x86/kernel/ftrace.c | 28 +
> include/linux/trace_events.h | 2 +
> kernel/trace/Kconfig | 12 +
> kernel/trace/Makefile | 1 +
> kernel/trace/trace.h | 11 +
> kernel/trace/trace_event_ftrace.c | 1440 +++++++++++++++++++++++++
> kernel/trace/trace_probe.h | 11 -
> 8 files changed, 1920 insertions(+), 11 deletions(-)
> create mode 100644 Documentation/trace/function-based-events.rst
> create mode 100644 kernel/trace/trace_event_ftrace.c
> --
> To unsubscribe from this list: send the line "unsubscribe linux-trace-users" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at http://vger.kernel.org/majordomo-info.html