Re: [PATCH 34/37] perf hists browser: Support flat callchains

From: Arnaldo Carvalho de Melo
Date: Tue Nov 24 2015 - 21:10:39 EST

Next message: Finn Thain: "Re: [PATCH 00/71] More fixes, cleanup and modernization for NCR5380 drivers"
Previous message: yamada.masahiro: "RE: [PATCH] ARM: use "depends on" for SoC configs instead of "if" after prompt"
In reply to: Arnaldo Carvalho de Melo: "Re: [PATCH 34/37] perf hists browser: Support flat callchains"
Next in thread: Namhyung Kim: "Re: [PATCH 34/37] perf hists browser: Support flat callchains"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Em Tue, Nov 24, 2015 at 10:34:18PM -0300, Arnaldo Carvalho de Melo escreveu:
> Em Wed, Nov 25, 2015 at 10:26:08AM +0900, Namhyung Kim escreveu:
> > On Tue, Nov 24, 2015 at 12:45:51PM -0200, Arnaldo Carvalho de Melo wrote:
> > > Em Tue, Nov 24, 2015 at 02:27:08PM +0900, Namhyung Kim escreveu:
> > > > On Mon, Nov 23, 2015 at 04:16:48PM +0100, Frederic Weisbecker wrote:
> > > > > On Thu, Nov 19, 2015 at 02:53:20PM -0300, Arnaldo Carvalho de Melo wrote:
> > > > > > From: Namhyung Kim <namhyung@xxxxxxxxxx>
> > > > > [...]
> > > > Thus I simply copied callchain lists in parents to leaf nodes. Yes,
> > > > it will consume some memory but can simplify the code.
> > >
> > > I haven't done any measuring, but I'm noticing that 'perf top -g' is
> > > showing more warnings about not being able to process events fast enough
> > > and so ends up losing events, I tried with --max-stack 16 and it helped,
> > > this is just a heads up.
> >
> > OK, but it seems that it's not related to this patch since this patch
> > only affects flat or folded callchain mode.
>
> Well, doesn't this patch makes some of the involved data structures
> larger, thus putting more pressure on the L1 cache, etc? It may well be
> related, but we need to measure.
>
> > > Perhaps my workstation workloads are gettning deeper callchains over
> > > time, but perhaps this is the cost of processing callchains that is
> > > increasing, I need to stop and try to quantify this.
> > >
> > > We really need to look at reducing the overhead of processing
> > > callchains.
> >
> > Right, but with my multi-thread work, I realized that perf is getting
> > heavier recently. I guess it's mostly due to the atomic refcount
> > work. I need to get back to the multi-thread work..
>
> We really need to measure this ;-)

So, something strange, if I use:

[acme@zoo linux]$ cat ~/bin/allmod
rm -rf ../build/allmodconfig/ ; mkdir ../build/allmodconfig/ ; make O=../build/allmodconfig/ allmodconfig ; make -j32 O=../build/allmodconfig
[acme@zoo linux]$

To generate background load, I don't see that much this:

+ 8.55% 8.49% libc-2ââWarning!âââââââââââââââââââââââââââââââââââââââââââââââ â
+ 7.08% 6.98% perf âEvents are being lost, check IO/CPU overload! â â
+ 6.84% 0.04% perf â â â
+ 6.01% 0.09% perf âYou may want to run 'perf' using a RT scheduler policy:â â
+ 5.26% 0.13% [kerneâ âtâ
+ 4.96% 1.50% perf â perf top -r 80 â â
+ 4.76% 3.58% perf â â â
+ 4.69% 0.05% perf âOr reduce the sampling frequency. â â

Its with a low loadavg:

[acme@zoo linux]$ cat /proc/loadavg
0.75 0.79 0.64 3/549 21259

That it pops up :-\

If I take a snapshot with 'P'

# head -40 perf.hist.0
+ 21.43% 21.09% libc-2.20.so [.] _int_malloc
+ 19.49% 0.00% perf [.] cmd_top
+ 19.46% 0.02% perf [.] perf_top__mmap_read_idx
+ 19.03% 0.06% perf [.] hist_entry_iter__add
+ 16.46% 1.85% perf [.] iter_add_next_cumulative_entry
+ 12.09% 11.98% libc-2.20.so [.] free
+ 10.68% 10.61% libc-2.20.so [.] __libc_calloc
+ 9.61% 0.09% perf [.] hists__decay_entries
+ 8.92% 8.85% libc-2.20.so [.] malloc_consolidate
+ 8.17% 6.33% perf [.] free_callchain_node
+ 7.94% 0.09% perf [.] hist_entry__delete
+ 6.22% 0.03% perf [.] callchain_append
+ 6.20% 6.11% perf [.] append_chain_children
+ 5.44% 1.50% perf [.] __hists__add_entry
+ 4.34% 0.14% [kernel] [k] entry_SYSCALL_64_fastpath
+ 3.69% 3.67% perf [.] sort__dso_cmp
+ 3.12% 0.20% perf [.] hists__output_resort
+ 2.88% 0.00% [unknown] [.] 0x6d86258d4c544155
+ 2.88% 0.00% libc-2.20.so [.] __libc_start_main
+ 2.88% 0.00% perf [.] main
+ 2.88% 0.00% perf [.] run_builtin
+ 2.66% 0.00% libpthread-2.20.so [.] start_thread
+ 2.66% 0.00% perf [.] display_thread_tui
+ 2.66% 0.00% perf [.] perf_evlist__tui_browse_hists
+ 2.66% 0.00% perf [.] perf_evsel__hists_browse
+ 2.49% 0.07% [kernel] [k] sys_futex
+ 2.42% 0.06% [kernel] [k] do_futex
2.24% 0.00% perf [.] perf_top__sort_new_samples
+ 1.92% 0.51% perf [.] hists__collapse_resort
+ 1.87% 1.86% perf [.] hpp__sort_overhead_acc
+ 1.69% 0.09% libc-2.20.so [.] __lll_unlock_wake_private
+ 1.45% 1.44% perf [.] hpp__nop_cmp
1.45% 1.43% perf [.] rb_erase
+ 1.44% 1.42% perf [.] __sort__hpp_cmp
1.31% 0.16% libc-2.20.so [.] __lll_lock_wait_private
1.18% 0.19% [kernel] [k] futex_wake
+ 1.13% 1.13% perf [.] sort__sym_cmp
1.11% 0.02% [kernel] [k] schedule
1.09% 0.22% [kernel] [k] __schedule
0.99% 0.08% [kernel] [k] futex_wait

So its quite a lot of mallocs, i.e. just do a 'perf top -g' and wait a
bit, malloc goes on bubbling up to the top, about the same time it
starts showing that popup screen telling that we're losing events.

If I use --no-children, to see if there is a difference, using either
--call-graph caller or callee, it doesn't get more than about 1%.

Ok, now I tried with "perf top --call-graph caller" i.e. with
--children, and looked at the _int_malloc callchains I get really long,
bogus callchains, see below. That explains why I don't lose events when
I use --max-stack.

I'll have to stop now, and I put the full perf.hist.1 at
http://vger.kernel.org/~acme/perf/perf.hist.1.xz

- Arnaldo

[root@zoo ~]# head -60 perf.hist.1
- 17.92% 17.10% libc-2.20.so [.] _int_malloc
+ 112.80% _int_malloc
11.14% 0x41bf5118
0x41bf5068
0x41bf5118
0x41bf5068
0x41bf5118
0x41bf5068
0x41bf5118
0x41bf5068
0x41bf5118
0x41bf5068
0x41bf5118
0x41bf5068
0x41bf5118
0x41bf5068
0x41bf5118
0x41bf5068
0x41bf5118
0x41bf5068
0x41bf5118
0x41bf5068
0x41bf5118
0x41bf5068
0x41bf5118
0x41bf5068
0x41bf5118
0x41bf5068
0x41bf5118
0x41bf5068
0x41bf5118
0x41bf5068
0x41bf5118
0x41bf5068
0x41bf5118
0x41bf5068
0x41bf5118
0x41bf5068
0x41bf5118
0x41bf5068
0x41bf5118
0x41bf5068
0x41bf5118
0x41bf5068
0x41bf5118
0x41bf5068
0x41bf5118
0x41bf5068
0x41bf5118
0x41bf5068
0x41bf5118
0x41bf5068
0x41bf5118
0x41bf5068
0x41bf5118
0x41bf5068
0x41bf5118
0x41bf5068
0x41bf5118
0x41bf5068
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Finn Thain: "Re: [PATCH 00/71] More fixes, cleanup and modernization for NCR5380 drivers"
Previous message: yamada.masahiro: "RE: [PATCH] ARM: use "depends on" for SoC configs instead of "if" after prompt"
In reply to: Arnaldo Carvalho de Melo: "Re: [PATCH 34/37] perf hists browser: Support flat callchains"
Next in thread: Namhyung Kim: "Re: [PATCH 34/37] perf hists browser: Support flat callchains"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]