Re: [PATCH v1] tools api io: Move filling the io buffer to its own function

From: Ian Rogers
Date: Thu May 30 2024 - 13:02:12 EST


On Thu, May 30, 2024 at 9:44 AM Namhyung Kim <namhyung@xxxxxxxxxx> wrote:
>
> On Thu, May 23, 2024 at 9:47 PM Ian Rogers <irogers@xxxxxxxxxx> wrote:
> >
> > On Thu, May 23, 2024 at 4:25 PM Namhyung Kim <namhyung@xxxxxxxxxx> wrote:
> > >
> > > On Sun, May 19, 2024 at 11:17 AM Ian Rogers <irogers@xxxxxxxxxx> wrote:
> > > >
> > > > In general a read fills 4kb so filling the buffer is a 1 in 4096
> > > > operation, move it out of the io__get_char function to avoid some
> > > > checking overhead and to better hint the function is good to inline.
> > > >
> > > > For perf's IO intensive internal (non-rigorous) benchmarks there's a
> > > > near 8% improvement to kallsyms-parsing with a default build.
> > >
> > > Oh, is it just from removing the io->eof check? Otherwise I don't
> > > see any difference.
> >
> > I was hoping that by moving the code out-of-line then the hot part of
> > the function could be inlined into things like reading the hex
> > character. I didn't see that, presumably there are too many callers
> > and so that made the inliner think sharing would be best even though
> > the hot code is a compare, pointer dereference and an increment. I
> > tried forcing inlining but it didn't seem to win over just having the
> > code out-of-line. The eof check should be very well predicted. The
> > out-of-line code was branched over forward, which should be 1
> > mispredict but again not a huge deal. I didn't do a more thorough
> > analysis as I still prefer to have the cold code out-of-line.
>
> Ok, I don't see much difference with this change. But the change itself
> looks fine.
>
> Thanks,
> Namhyung
>
>
> Before:
>
> # Running internals/synthesize benchmark...
> Computing performance of single threaded perf event synthesis by
> synthesizing events on the perf process itself:
> Average synthesis took: 237.274 usec (+- 0.066 usec)
> Average num. events: 24.000 (+- 0.000)
> Average time per event 9.886 usec
> Average data synthesis took: 241.126 usec (+- 0.087 usec)
> Average num. events: 128.000 (+- 0.000)
> Average time per event 1.884 usec
>
> # Running internals/kallsyms-parse benchmark...
> Average kallsyms__parse took: 184.374 ms (+- 0.022 ms)
>
> # Running internals/inject-build-id benchmark...
> Average build-id injection took: 20.096 msec (+- 0.115 msec)
> Average time per event: 1.970 usec (+- 0.011 usec)
> Average memory usage: 11574 KB (+- 29 KB)
> Average build-id-all injection took: 13.477 msec (+- 0.100 msec)
> Average time per event: 1.321 usec (+- 0.010 usec)
> Average memory usage: 11160 KB (+- 0 KB)
>
> # Running internals/evlist-open-close benchmark...
> Number of cpus: 64
> Number of threads: 1
> Number of events: 1 (64 fds)
> Number of iterations: 100
> evlist__open: Permission denied
>
> # Running internals/pmu-scan benchmark...
> Computing performance of sysfs PMU event scan for 100 times
> Average core PMU scanning took: 135.880 usec (+- 0.249 usec)
> Average PMU scanning took: 816.745 usec (+- 48.293 usec)
>
>
> After:
>
> # Running internals/synthesize benchmark...
> Computing performance of single threaded perf event synthesis by
> synthesizing events on the perf process itself:
> Average synthesis took: 235.711 usec (+- 0.067 usec)
> Average num. events: 24.000 (+- 0.000)
> Average time per event 9.821 usec
> Average data synthesis took: 240.992 usec (+- 0.058 usec)
> Average num. events: 128.000 (+- 0.000)
> Average time per event 1.883 usec
>
> # Running internals/kallsyms-parse benchmark...
> Average kallsyms__parse took: 179.664 ms (+- 0.043 ms)

So this is still 2%. I was building without options like DEBUG=1
enabled, so perhaps that'd explain the difference. Anyway, if you're
more comfortable with a commit message saying a 2% performance win I
don't mind it being updated or I can upload a v2. It's likely this is
being over-thought given the change :-)

Thanks,
Ian

> # Running internals/inject-build-id benchmark...
> Average build-id injection took: 19.901 msec (+- 0.117 msec)
> Average time per event: 1.951 usec (+- 0.011 usec)
> Average memory usage: 12163 KB (+- 10 KB)
> Average build-id-all injection took: 13.627 msec (+- 0.086 msec)
> Average time per event: 1.336 usec (+- 0.008 usec)
> Average memory usage: 11160 KB (+- 0 KB)
>
> # Running internals/evlist-open-close benchmark...
> Number of cpus: 64
> Number of threads: 1
> Number of events: 1 (64 fds)
> Number of iterations: 100
> evlist__open: Permission denied
>
> # Running internals/pmu-scan benchmark...
> Computing performance of sysfs PMU event scan for 100 times
> Average core PMU scanning took: 136.540 usec (+- 0.294 usec)
> Average PMU scanning took: 819.415 usec (+- 48.437 usec)