Re: [GIT PULL] Clang feature updates for v5.14-rc1

From: Ingo Molnar
Date: Wed Jul 07 2021 - 04:10:49 EST



* Nick Desaulniers <ndesaulniers@xxxxxxxxxx> wrote:

> > And I really hate how pretty much all of the PGO support seems to be
> > just about this inferior method of getting the data.
>
> Right now we're having trouble with hardware performance counters on
> non-intel chips; I don't think we have working LBR equivalents on AMD
> until zen3, and our ETM based samples on ARM are hung up on a few last
> minute issues requiring new hardware (from multiple different chipset
> vendors).
>
> It would be good to have some form profile based optimizations that
> aren't architecture or microarchitecture dependent.

That doesn't excuse using an inferior tooling ABI design though. By your
own description proper hardware LBR support on the platforms you care most
about is either there or close - yet the whole Clang PGO feature is
designed around software based compiler instrumentation? That's backwards.

The right technical solution to integrate the clang-pgo software
instrumetnation would be to implement a minimal software-LBR PMU
functionality on top of the clang-pgo engine, and use unified perf tooling
to process the branch tracing/profiling information.

In the main PGO thread PeterZ made a couple of technical suggestions about
how this can be done using the existing hardware LBR interfaces of perf,
but we are flexible if the design is sane and are open to improvements.

I.e. try to commonalize the tooling data as soon as possible - not very
late as in the current proposal, exposing a whole stack of APIs and ABIs to
clang-pgo specific interfaces.

The "LBR data unification" approach has numerous short term and long term
advantages:

- Hardware assisted LBR tracing support out of the box on two major
hardware platforms (Power and x86), and on some ARM platforms "soon",
maybe sooner than this feature trickles down to distributions to begin
with.

- GCC won't have to reinvent the wheel - they only need to make sure they
can generate the minimal LBR data. In that sense perf is an
'independent' tooling facility they might be more comfortable working
with as well, than a 'competing' compiler project.

- There's even a chance that existing instrumentation can be reused - or a
relatively self-contained compiler plugin can generate it.

- Lower maintenance overhead, lower risk of subsystem obsolescence.

Binding this feature to clang-pgo on the ABI level is not a good move for
the Linux kernel IMO.

So until this is implemented properly, or adequate explanation is given why
I'm wrong:

NACKED-by: Ingo Molnar <mingo@xxxxxxxxxx>

Both for the core kernel and x86 bits.

Please preserve this NAK and mention it prominently in future iterations of
this feature. Please Cc: me on future postings.

Thanks,

Ingo