Re: [PATCH v5 00/50] Improvements to memory use

From: Ian Rogers
Date: Wed Dec 06 2023 - 19:11:29 EST


On Wed, Nov 29, 2023 at 5:16 PM Namhyung Kim <namhyung@xxxxxxxxxx> wrote:
>
> On Mon, Nov 27, 2023 at 2:09 PM Ian Rogers <irogers@xxxxxxxxxx> wrote:
> >
> > Fix memory leaks detected by address/leak sanitizer affecting LBR
> > call-graphs, perf mem and BPF offcpu.
> >
> > Make branch_type_stat in callchain_list optional as it is large and
> > not always necessary - in particular it isn't used by perf top.
> >
> > Make the allocations of zstd streams, kernel symbols and event copies
> > lazier in order to save memory in cases like perf record.
> >
> > Handle the thread exit event and have it remove the thread from the
> > threads set in machine. Don't do this for perf report as it causes a
> > regression for task lists, which assume threads are never removed from
> > the machine's set, and offcpu events, that may sythensize samples for
> > threads that have exited.
> >
> > Avoid using 8kb buffers for filename__read_str which is excessive for
> > reading CPU maps. Add io_dir as an allocation free readdir
> > replacement, opendir allocating 32kb by default and the code uses it
> > recursively.
> >
> > Shrink perf map using a two value byte to replace two function
> > pointers. Modify the implementation of maps to not use an rbtree as
> > the container for maps, instead use a sorted array. Improve locking
> > and reference counting issues.
> >
> > Similar to maps separate out and reimplement threads to use a hashmap
> > for lower memory consumption and faster look up. The fixes a
> > regression in memory usage where reference count checking switched to
> > using non-invasive tree nodes. Reduce its default size by 32 times
> > and improve locking discipline. Also, fix regressions where tids had
> > become unordered to make `perf report --tasks` and
> > `perf trace --summary` output easier to read.
> >
> > Better encapsulate the dsos abstraction. Remove the linked list and
> > rbtree used for faster iteration and log(n) lookup to a sorted array
> > for similar performance but half the memory usage per dso. Improve
> > reference counting and locking discipline, adding reference count
> > checking to dso. Experimented with, but abandoned, a hashmap
> > implementation due to the need for extra storage and the keys not
> > being stable.
> >
> > The overall effect is to reduce memory consumption significantly for
> > perf top - with call graphs enabled running longer before 1GB of
> > memory is consumed. For a perf record of 'true', the memory
> > consumption goes from 39912kb max resident to 20096kb max resident -
> > nearly halved. perf inject with -b of a system wide perf record of
> > 'true' reduces the max resident by roughly 4.5% (3.4% in v4 due to
> > branch_type_stat changes being merged). This is while improving
> > correctness with locking discipline and reference count checking.
> >
> > Patch organization (v5):
> > - 50 patches is a lot, the patches aren't divided as they merge conflict and
> > later patches, for example in dsos, rely on the changes and fixes to maps.
>
> You don't need to do it all at once. AFAIK the io_dir changes are independent
> and you can separate map/maps changes from others. Maybe you can wait
> for map changes merged before working on the dso changes. I know it'd take
> more time but it'd be easier to deal with smaller patches focusing on a single
> factor both for you and the reviewers.

Agreed on the io_dir changes, they were intentionally first so they
were easy to take, but I can make them their own series.
The dsos changes are only asan clean with the maps changes, so I
prefer to keep these two longer series together.

Thanks,
Ian

> p.s. I know I also have a set of ~50 patches and feel sorry about saying
> like this. ;-p Maybe I need to split the data type profiling series too.
>
> Thanks,
> Namhyung
>
>
> > - the dso reference count checking patch is larger due to switch use of dso to
> > be by accessors, to encapsulate the reference count checker macros. The
> > reference count checking changes within this largely mechanical change amount
> > to a few lines and so weren't separated.
> > - the first patch contains a build fix if the rwsem error checking is
> > enabled missed from v3.
> > - the next patches are an assortment of memory size fixes.
> > - the next patches are the refactoring of maps.
> > - the next patches are the refactoring of threads.
> > - the next patches are the refactoring of dsos.
> > - finally reference count checking is added to dso and some lock/reference
> > count issues are resolved. This is done after changing the data structures,
> > for example, as the single pointer on an array is easier to add reference
> > count checking to compared to the 5 previous pointers.
> >
> > v5: 3 patches were merged. 2nd patch addressed feedback from
> > namhyung@xxxxxxxxxx and Guilherme Amadio <amadio@xxxxxxxxxx>. 4th
> > patch rename function to getdelim as suggested by
> > namhyung@xxxxxxxxxx. 5 patch adds the missing sysfs mountpoint as
> > suggested by namhyung@xxxxxxxxxx. 49th patch fix a missed put in
> > the dso_data tests.
> > v4: Rebased as 11 changes moved to perf-tools-next. Address comments
> > from v3 such as error checking on zstd streams. Improve the
> > dsos/dso in ways similar to threads and maps, with the addition of
> > reference count checking on dso.
> > v3: Additional memory/speed improvements, in particular for maps and
> > threads. Address review comments from namhyung@xxxxxxxxxx and
> > adrian.hunter@xxxxxxxxx.
> > v2: Add additional memory fixes on top of initial LBR and rc check
> > fixes.
> >
> > Ian Rogers (50):
> > perf comm: Use regular mutex
> > libperf: Lazily allocate/size mmap event copy
> > perf mmap: Lazily initialize zstd streams
> > tools api fs: Switch filename__read_str to use io.h
> > tools api fs: Avoid reading whole file for a 1 byte bool
> > tools lib api: Add io_dir an allocation free readdir alternative
> > perf maps: Switch modules tree walk to io_dir__readdir
> > perf record: Be lazier in allocating lost samples buffer
> > perf pmu: Switch to io_dir__readdir
> > perf header: Switch mem topology to io_dir__readdir
> > perf events: Remove scandir in thread synthesis
> > perf map: Simplify map_ip/unmap_ip and make map size smaller
> > perf maps: Move symbol maps functions to maps.c
> > perf thread: Add missing RC_CHK_EQUAL
> > perf maps: Add maps__for_each_map to call a function on each entry
> > perf maps: Add remove maps function to remove a map based on callback
> > perf debug: Expose debug file
> > perf maps: Refactor maps__fixup_overlappings
> > perf maps: Do simple merge if given map doesn't overlap
> > perf maps: Rename clone to copy from
> > perf maps: Add maps__load_first
> > perf maps: Add find next entry to give entry after the given map
> > perf maps: Reduce scope of map_rb_node and maps internals
> > perf maps: Fix up overlaps during fixup_end
> > perf maps: Switch from rbtree to lazily sorted array for addresses
> > perf maps: Get map before returning in maps__find
> > perf maps: Get map before returning in maps__find_by_name
> > perf maps: Get map before returning in maps__find_next_entry
> > perf maps: Hide maps internals
> > perf maps: Locking tidy up of nr_maps
> > perf dso: Reorder variables to save space in struct dso
> > perf report: Sort child tasks by tid
> > perf trace: Ignore thread hashing in summary
> > perf machine: Move fprintf to for_each loop and a callback
> > perf threads: Move threads to its own files
> > perf threads: Switch from rbtree to hashmap
> > perf threads: Reduce table size from 256 to 8
> > perf dsos: Attempt to better abstract dsos internals
> > perf dsos: Tidy reference counting and locking
> > perf dsos: Add dsos__for_each_dso
> > perf dso: Move dso functions out of dsos
> > perf dsos: Switch more loops to dsos__for_each_dso
> > perf dsos: Switch backing storage to array from rbtree/list
> > perf dsos: Remove __dsos__addnew
> > perf dsos: Remove __dsos__findnew_link_by_longname_id
> > perf dsos: Switch hand code to bsearch
> > perf dso: Add reference count checking and accessor functions
> > perf dso: Reference counting related fixes
> > perf dso: Use container_of to avoid a pointer in dso_data
> > perf env: Avoid recursively taking env->bpf_progs.lock
> >
> > tools/lib/api/Makefile | 2 +-
> > tools/lib/api/fs/fs.c | 80 +-
> > tools/lib/api/io.h | 11 +-
> > tools/lib/api/io_dir.h | 75 +
> > tools/lib/perf/include/internal/mmap.h | 3 +-
> > tools/lib/perf/mmap.c | 21 +-
> > tools/perf/arch/x86/tests/dwarf-unwind.c | 1 +
> > tools/perf/arch/x86/util/event.c | 103 +-
> > tools/perf/builtin-annotate.c | 6 +-
> > tools/perf/builtin-buildid-cache.c | 2 +-
> > tools/perf/builtin-buildid-list.c | 18 +-
> > tools/perf/builtin-inject.c | 96 +-
> > tools/perf/builtin-kallsyms.c | 2 +-
> > tools/perf/builtin-mem.c | 4 +-
> > tools/perf/builtin-record.c | 57 +-
> > tools/perf/builtin-report.c | 243 ++--
> > tools/perf/builtin-script.c | 8 +-
> > tools/perf/builtin-top.c | 4 +-
> > tools/perf/builtin-trace.c | 41 +-
> > tools/perf/tests/code-reading.c | 8 +-
> > tools/perf/tests/dso-data.c | 67 +-
> > tools/perf/tests/hists_common.c | 6 +-
> > tools/perf/tests/hists_cumulate.c | 4 +-
> > tools/perf/tests/hists_output.c | 2 +-
> > tools/perf/tests/maps.c | 64 +-
> > tools/perf/tests/symbols.c | 2 +-
> > tools/perf/tests/thread-maps-share.c | 8 +-
> > tools/perf/tests/vmlinux-kallsyms.c | 181 +--
> > tools/perf/ui/browsers/annotate.c | 6 +-
> > tools/perf/ui/browsers/hists.c | 8 +-
> > tools/perf/ui/browsers/map.c | 4 +-
> > tools/perf/util/Build | 1 +
> > tools/perf/util/annotate.c | 44 +-
> > tools/perf/util/auxtrace.c | 2 +-
> > tools/perf/util/block-info.c | 2 +-
> > tools/perf/util/bpf-event.c | 17 +-
> > tools/perf/util/bpf-event.h | 12 +-
> > tools/perf/util/bpf_lock_contention.c | 10 +-
> > tools/perf/util/build-id.c | 136 +-
> > tools/perf/util/build-id.h | 2 -
> > tools/perf/util/callchain.c | 4 +-
> > tools/perf/util/comm.c | 10 +-
> > tools/perf/util/compress.h | 6 +-
> > tools/perf/util/data-convert-json.c | 2 +-
> > tools/perf/util/db-export.c | 6 +-
> > tools/perf/util/debug.c | 22 +-
> > tools/perf/util/debug.h | 1 +
> > tools/perf/util/dlfilter.c | 12 +-
> > tools/perf/util/dso.c | 468 ++++---
> > tools/perf/util/dso.h | 544 ++++++--
> > tools/perf/util/dsos.c | 529 ++++---
> > tools/perf/util/dsos.h | 40 +-
> > tools/perf/util/env.c | 53 +-
> > tools/perf/util/env.h | 4 +
> > tools/perf/util/event.c | 12 +-
> > tools/perf/util/header.c | 47 +-
> > tools/perf/util/hist.c | 4 +-
> > tools/perf/util/intel-pt.c | 22 +-
> > tools/perf/util/machine.c | 652 +++------
> > tools/perf/util/machine.h | 32 +-
> > tools/perf/util/map.c | 93 +-
> > tools/perf/util/map.h | 83 +-
> > tools/perf/util/maps.c | 1239 +++++++++++++----
> > tools/perf/util/maps.h | 95 +-
> > tools/perf/util/mmap.c | 5 +-
> > tools/perf/util/mmap.h | 1 -
> > tools/perf/util/pmu.c | 48 +-
> > tools/perf/util/pmus.c | 30 +-
> > tools/perf/util/probe-event.c | 62 +-
> > tools/perf/util/rb_resort.h | 5 -
> > .../scripting-engines/trace-event-python.c | 21 +-
> > tools/perf/util/session.c | 21 +
> > tools/perf/util/session.h | 2 +
> > tools/perf/util/sort.c | 19 +-
> > tools/perf/util/srcline.c | 65 +-
> > tools/perf/util/symbol-elf.c | 138 +-
> > tools/perf/util/symbol.c | 521 ++-----
> > tools/perf/util/symbol.h | 1 -
> > tools/perf/util/symbol_fprintf.c | 4 +-
> > tools/perf/util/synthetic-events.c | 156 ++-
> > tools/perf/util/thread.c | 48 +-
> > tools/perf/util/thread.h | 6 -
> > tools/perf/util/threads.c | 186 +++
> > tools/perf/util/threads.h | 35 +
> > tools/perf/util/unwind-libunwind-local.c | 50 +-
> > tools/perf/util/unwind-libunwind.c | 9 +-
> > tools/perf/util/vdso.c | 89 +-
> > tools/perf/util/zstd.c | 63 +-
> > 88 files changed, 4101 insertions(+), 2827 deletions(-)
> > create mode 100644 tools/lib/api/io_dir.h
> > create mode 100644 tools/perf/util/threads.c
> > create mode 100644 tools/perf/util/threads.h
> >
> > --
> > 2.43.0.rc1.413.gea7ed67945-goog
> >