Re: [RFC] perf: Dwarf cfi based user callchains

From: Frederic Weisbecker
Date: Wed Oct 20 2010 - 11:35:17 EST


On Wed, Oct 13, 2010 at 11:13:27AM -0400, Frank Ch. Eigler wrote:
>
> Frederic Weisbecker <fweisbec@xxxxxxxxx> writes:
>
> > [...]
> > This brings dwarf cfi based callchain for userspace apps that don't have
> > frame pointers.
>
> Interesting approach!
>
> > [...]
> > - it's slow. A first improvement to make it faster is to support binary
> > search from .eh_frame_hdr.
>
> In systemtap land, we did find a dramatic improvement from this too.


Yeah, linear walking on .eh_frame is not a good thing other than for
testing and debugging.



> Have you measured the cost of transcribing of potentially large chunks
> of the user stacks? We did not seriously evaluate this path, since we
> encounter megabyte+ stacks in larger userspace apps, and copying THAT
> out seemed absurd.


Actually that's quite affordable. And you don't need to dump all the
stack of a process on every samples, that would indeed be absurd.
A small chunk, starting from the stack pointer, is enough.

What is interesting is that you can play with several different
sizes of dump, the higher it is and the deeper you'll be able to
unwind, that also means more profiling overhead.

I can't measure significant throughput issues with hackbench
for example:


Normal run:

$ time ./hackbench 10
Time: 3.415

real 0m3.506s
user 0m0.257s
sys 0m6.519s


With perf record (default is cycles counter with 1000 HZ samples frequency):

$ time ./perf record ./hackbench 10
Time: 3.584
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.381 MB perf.data (~16654 samples) ]

real 0m3.748s
user 0m0.028s
sys 0m0.022s


With perf record + frame pointer based callchain capture

$ time ./perf record -g fp ./hackbench 10
Time: 3.666
[ perf record: Woken up 3 times to write data ]
[ perf record: Captured and wrote 1.426 MB perf.data (~62281 samples) ]

real 0m3.834s
user 0m0.033s
sys 0m0.046s

With perf record + 4096 bytes of stack dump in every sample


$ time ./perf record -g dwarf,4096 ./hackbench 10
Time: 3.697
[ perf record: Woken up 15 times to write data ]
[ perf record: Captured and wrote 5.156 MB perf.data (~225251 samples) ]

real 0m3.931s
user 0m0.026s
sys 0m0.135s

With perf record + 20000 bytes of stack dump in every sample

$ time ./perf record -g dwarf,20000 ./hackbench 10
Time: 3.847
[ perf record: Woken up 9 times to write data ]
[ perf record: Captured and wrote 13.349 MB perf.data (~583219 samples) ]

real 0m4.559s
user 0m0.028s
sys 0m0.329s


So there is no big differences. May be hackbench is not a good example
to highlight the possible impact though. And some tuning with higher
frequencies would make the difference more visible.

Perhaps the real impact is more on the amount to record, the data file
tends to grow quickly, obviously.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/