Re: [PATCH 0/5] [RFC] binary reading of ftrace ring buffers
From: Mathieu Desnoyers
Date: Wed Mar 04 2009 - 10:39:52 EST
* Ingo Molnar (mingo@xxxxxxx) wrote:
>
> * Steven Rostedt <rostedt@xxxxxxxxxxx> wrote:
>
> > RFC only, not for pulling, unless everyone is fine with these
> > :-)
> >
> > After telling the folks from Google that I had splice code
> > that I needed to get working, I finally got around to doing
> > it.
> >
> > Not only did I get the splice code working, but I also made a
> > perl script (might want to cover you eyes if you look at that
> > code) that can automagically create the output from the binary
> > files reading the format arguments in the
> > /debugfs/tracing/events/<subsys>/<event>/format file.
>
> Very cool stuff!
>
> The ftrace splice path allows the following tracing path: after
> we create a trace ringbuffer page in the kernel, we dont touch
> it _ever again_. We can shuffle it to disk via DMA or over the
> network via DMA without bringing it again into the CPU's caches.
>
> This feature allows low-overhead high-throughput tracing on
> unprecedented levels. There's no format string overhead, no
> instruction patching/trapping overhead - straight raw binary
> tracing with C syntax tracepoints and a zero-copy path to
> storage.
>
> Have you had any chance to measure tracing overhead and the
> maximum throghput we can reach with the ftrace splice code? I'd
> expect to see some really nice numbers.
>
"unprecented levels" -> LTTng has been using splice for about 5 months.
The only reason why I did not post the patchset which performs this is
because I am currently adapting a LTTng module to optionally format the
binary buffers to a text output. And yes, it provides very, very good
performance results.
Mathieu
> > >From previous patches, we have in include/trace/sched_event_types.h:
> >
> > #undef TRACE_SYSTEM
> > #define TRACE_SYSTEM sched
> >
> > TRACE_EVENT_FORMAT(sched_switch,
> > TPPROTO(struct rq *rq, struct task_struct *prev,
> > struct task_struct *next),
> > TPARGS(rq, prev, next),
> > TPFMT("task %s:%d ==> %s:%d",
> > prev->comm, prev->pid, next->comm, next->pid),
> > TRACE_STRUCT(
> > TRACE_FIELD(pid_t, prev_pid, prev->pid)
> > TRACE_FIELD(int, prev_prio, prev->prio)
> > TRACE_FIELD_SPECIAL(char next_comm[TASK_COMM_LEN],
> > next_comm,
> > TPCMD(memcpy(TRACE_ENTRY->next_comm,
> > next->comm,
> > TASK_COMM_LEN)))
> > TRACE_FIELD(pid_t, next_pid, next->pid)
> > TRACE_FIELD(int, next_prio, next->prio)
> > ),
> > TPRAWFMT("prev %d:%d ==> next %s:%d:%d")
> > );
> >
> > By making that, we automagically get this file:
> >
> > # cat /debug/tracing/events/sched/sched_switch/format
> > name: sched_switch
> > ID: 29
> > format:
> > field:unsigned char type; offset:0; size:1;
> > field:unsigned char flags; offset:1; size:1;
> > field:unsigned char preempt_count; offset:2; size:1;
> > field:int pid; offset:4; size:4;
> > field:int tgid; offset:8; size:4;
> >
> > field:pid_t prev_pid; offset:12; size:4;
> > field:int prev_prio; offset:16; size:4;
> > field special:char next_comm[TASK_COMM_LEN]; offset:20; size:16;
> > field:pid_t next_pid; offset:36; size:4;
> > field:int next_prio; offset:40; size:4;
> >
> > print fmt: "prev %d:%d ==> next %s:%d:%d"
> >
> >
> > Now with this patch set, we create a way to read the ftrace
> > ring buffers directly, as a binary page. Splice has been used
> > such that the user could splice the ring buffers without need
> > to copy the pages. The pages are taken from the ring buffers
> > and can be placed directly into files, without extra copies.
> >
> > # ls /debug/tracing/binary_buffers/
> > 0 1 2 3 4 5 6 7
> >
> > One can either just use the read/write to grab live data from
> > these buffers, or they could use splice. I have a simple file
> > that reads this buffers using splice. Note, it only runs on
> > one file, you can make multiple copies to run more. The ring
> > buffers in ftrace are per cpu and they are not dependent on
> > each other.
> >
> > Also, if there is no data in the buffer, it returns -EAGAIN.
> >
> > # find /debug/tracing/events -name 'type' | while read f; do
> > > echo raw > $f; done
> >
> > # find /debug/tracing/events -name 'enable' | while read f; do
> > > echo 1 > $f; done
> >
> > # ./splice /debug/tracing/binary_buffers/0 /tmp/buf-0
> >
> > Yes you can run multiple instances of this on different buffers.
> >
> > # ./rb-read.pl /tmp/buf-0
> >
> > produces:
> >
> > 0 [000] 7071.936459: (irq_handler_entry) irq 48
> > 0 [000] 7071.936462: (irq_handler_exit) irq 48 ret 1
> > 0 [000] 7071.988801: (sched_signal_send) sig: 14 task 0
> > 0 [000] 7071.988813: (sched_wakeup) task 0 success=1
> > 0 [000] 7071.988823: (sched_switch) prev 0:140 ==> next ntpd:6582272:0
> > 3303 [000] 7071.988916: (sched_switch) prev 0:120 ==> next swap:7497072:0
> > 0 [000] 7072.020370: (sched_wakeup) task 0 success=1
> > 0 [000] 7072.020407: (sched_switch) prev 0:140 ==> next even:808416116:0
> > 16 [000] 7071.936157: (sched_switch) prev 0:115 ==> next swap:7497072:0
> >
> > Notice the "(sched_switch)" lines.
> >
> > The splice C code:
> > http://www.kernel.org/pub/linux/kernel/people/rostedt/splice.c
> >
> > The perl script to parse:
> > http://www.kernel.org/pub/linux/kernel/people/rostedt/rb-read.pl
> >
> > And yes I know, they are both ugly :-p
>
> Any chance of merging them into a single perl script? I.e. the
> perl script would embedd the .c code and build it on every
> invocation (into /tmp), so that one can do single-tool tracing
> with no immediate binary components. Such self-sufficiency is
> very handy when probing systems in a minimally invasive way.
>
> And we could also have twice the ugliness for the same price.
>
> > The following patches are in:
> >
> > git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace.git
> >
> > branch: rfc/splice/tip/tracing/ftrace
>
> Looks like the only problem with the code is the -EFAULT
> handling bug it exposed in simple_read_from_buffer() - but that
> fix should go on a separate track IMO.
>
> So i've pulled it into tip:tracing - thanks Steve! We definitely
> want this in v2.6.30.
>
> Ingo
>
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/