Re: [PATCH] perf ordered_events: Optimise event object reuse

From: Jiri Olsa
Date: Wed May 20 2020 - 17:52:44 EST


On Wed, May 20, 2020 at 02:00:49PM +0100, Matt Fleming wrote:
> On Mon, 18 May, at 02:04:08PM, Jiri Olsa wrote:
> > On Fri, May 15, 2020 at 10:01:51PM +0100, Matt Fleming wrote:
> > > ordered_event objects can be placed on the free object cache list in any
> > > order which means future allocations may not return objects at
> > > sequential locations in memory. Getting non-contiguous objects from the
> > > free cache has bad consequences when later iterating over those objects
> > > in ordered_events__queue().
> > >
> > > For example, large perf.data files can contain trillions of events and
> > > since objects that are next to each other in the free cache linked list
> > > can point to pretty much anywhere in the object address space, lots of
> > > cycles in ordered_events__queue() are spent servicing DTLB misses.
> > >
> > > Implement the free object cache using the in-kernel implementation of
> > > interval trees so that objects can always be allocated from the free
> > > object cache in sequential order, improving spatial locality and
> > > reducing DTLB misses.
> > >
> > > Here are some numbers showing the speed up (reducing in execution time)
> > > when running perf sched latency on sched events data and perf report on
> > > HW_CPU_CYCLES.
> >
> > really nice, few questions below
> >
> > >
> > > $ perf stat --null -r 10 -- bash -c \
> > > "export PAGER=cat ; perf sched latency -i $file --stdio &>/dev/null"
> > >
> > > Nr events File Size Before After Speed up
> > > -------------- --------- -------- ------- ----------
> > > 123318457470 29MB 0.2149 0.2440 -13.5%
> >
> > should we be concerned about small data and the extra processing?
>
> I didn't look into this slowdown originally because it's ~2.9 ms, but
> FYI it looks like this is caused by:
>
> - Longer code paths (more instructions)
> - More branches
> - More branch mispredicts
>
> > maybe we could add some option that disables this, at leat to be
> > able to compare times in the future
>
> Sure. Do you mean a command-line option or build-time config?

command line option would be great

SNIP

> > > diff --git a/tools/perf/tests/free-object-cache.c b/tools/perf/tests/free-object-cache.c
> > > new file mode 100644
> > > index 000000000000..e4395ece7d2b
> > > --- /dev/null
> > > +++ b/tools/perf/tests/free-object-cache.c
> > > @@ -0,0 +1,200 @@
> > > +// SPDX-License-Identifier: GPL-2.0
> > > +#include "tests.h"
> > > +#include <linux/kernel.h>
> > > +
> > > +#define ordered_events__flush_time __test_ordered_events__flush_time
> > > +#define ordered_events__first_time __test_ordered_events__first_time
> > > +#define ordered_events__delete __test_ordered_events__delete
> > > +#define ordered_events__init __test_ordered_events__init
> > > +#define ordered_events__free __test_ordered_events__free
> > > +#define ordered_events__queue __test_ordered_events__queue
> > > +#define ordered_events__reinit __test_ordered_events__reinit
> > > +#define ordered_events__flush __test_ordered_events__flush
> >
> > I'm excited to see these tests, but why is above needed?
> >
> > can't you use ordered-events interface as it is? you used only
> > exported functions right?
>
> Nope, the tests in this file are unit tests so I'm testing
> free_cache_{get,put} which are file-local functions by #include'ing
> ordered-events.c.
>
> The above define are required to avoid duplicate symbol errors at
> link-time, e.g.
>
> util/perf-in.o: In function `ordered_events__flush_time':
> /home/matt/src/kernels/linux/tools/perf/util/ordered-events.c:461: multiple definition of `ordered_events__flush_time'
> tests/perf-in.o:/home/matt/src/kernels/linux/tools/perf/tests/../util/ordered-events.c:461: first defined here
>
> There are other ways to resolve this (linker flags to change the
> symbols) but I couldn't find any precedent with that, so this seemed
> like the easiest and most obvious solution. I'm happy to fix this up any
> other way if you have suggestions though.

hum, could we just make free_cache_{get,put} public?

thanks,
jirka