Re: [GIT PULL] Clang feature updates for v5.14-rc1

From: Nick Desaulniers
Date: Tue Jun 29 2021 - 17:27:55 EST


On Tue, Jun 29, 2021 at 2:04 PM Linus Torvalds
<torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
>
> On Tue, Jun 29, 2021 at 1:44 PM Kees Cook <keescook@xxxxxxxxxxxx> wrote:
> > >
> > > And it causes the kernel to be bigger and run slower.
> >
> > Right -- that's expected. It's not designed to be the final kernel
> > someone uses. :)
>
> Well, from what I've seen, you actually want to run real loads in
> production environments for PGO to actually be anything but a bogus
> "performance benchmarks only" kind of thing.
>
> Of course, "performance benchmarks only" is very traditional, and
> we've seen that used over and over in the past in this industry. That
> doesn't make it _right_, though.

The current major use case is ensuring that production kernels have
been "trained" with specific workloads in mind.

> And if you actually want to have it usable in production environments,
> you really should strive to run code as closely as possible to a
> production kernel too.

You could do both. There is a line of research internally using
multiple training rounds ("CSPGO").

> You'd want to run something that you can sample over time, and in
> production, not something that you have to build a special kernels for
> that then gets used for a benchmark run, but can't be kept in
> production because it performs so much worse.
>
> Real proper profiles will tell you what *really* matters - and if you
> don't have enough samples to give you good information, then that
> particular code clearly is not important enough to waste PGO on.
>
> This is not all that dissimilar to using gprof information for
> traditional - manual - optimizations.
>
> Sure, instrumented gprof output is better than nothing, but it is
> *hugely* worse than actual proper sampled profiles that actually show
> what matters for performance (as opposed to what runs a lot - the two
> are not necessarily all that closely correlated, with cache misses
> being a thing).
>
> And I really hate how pretty much all of the PGO support seems to be
> just about this inferior method of getting the data.

Right now we're having trouble with hardware performance counters on
non-intel chips; I don't think we have working LBR equivalents on AMD
until zen3, and our ETM based samples on ARM are hung up on a few last
minute issues requiring new hardware (from multiple different chipset
vendors).

It would be good to have some form profile based optimizations that
aren't architecture or microarchitecture dependent.
--
Thanks,
~Nick Desaulniers