Re: [GIT PULL] Clang feature updates for v5.14-rc1
From: Bill Wendling
Date: Mon Oct 31 2022 - 19:56:16 EST
On Fri, Jul 2, 2021 at 5:57 AM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
>
> On Fri, Jul 02, 2021 at 05:46:46AM -0700, Bill Wendling wrote:
> > On Tue, Jun 29, 2021 at 2:04 PM Linus Torvalds
> > <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
> > >
> > > On Tue, Jun 29, 2021 at 1:44 PM Kees Cook <keescook@xxxxxxxxxxxx> wrote:
> > > > >
> > > > > And it causes the kernel to be bigger and run slower.
> > > >
> > > > Right -- that's expected. It's not designed to be the final kernel
> > > > someone uses. :)
> > >
> > > Well, from what I've seen, you actually want to run real loads in
> > > production environments for PGO to actually be anything but a bogus
> > > "performance benchmarks only" kind of thing.
> > >
> > The reason we use PGO in this way is because we _cannot_ release a
> > kernel into production that hasn't had PGO applied to it. The
> > performance of a non-PGO'ed kernel is a non-starter for rollout. We
> > try our best to replicate this environment for the benchmarks, which
> > is the only sane way to do this. I can't imagine that we're the only
> > ones who run up against this chicken-and-egg problem.
> >
> > For why we don't use sampling, PGO gives a better performance boost
> > from an instrumented kernel rather than a sampled profile. I'll work
> > on getting statistics to show this.
>
> I've asked this before; *what* is missing from LBR samples that's
> reponsible for the performance gap?
For one, it lacks information on function call frequencies, which help
inlining. It's also much more coarse-grained than a perf trace. And
while a section of code that doesn't show up in a trace sample may not
be executed much, changes to it may have cascading effects.
Ingo mentioned had some ideas on minimal software LBR PMU
functionality. Do you have a link to this discussion?
"The right technical solution to integrate the clang-pgo software
instrumetnation would be to implement a minimal software-LBR PMU
functionality on top of the clang-pgo engine, and use unified perf tooling
to process the branch tracing/profiling information.
"In the main PGO thread PeterZ made a couple of technical suggestions about
how this can be done using the existing hardware LBR interfaces of perf,
but we are flexible if the design is sane and are open to improvements."
-bw