Re: Additional debug info to aid cacheline analysis
From: Peter Zijlstra
Date: Thu Oct 08 2020 - 03:02:45 EST
My appologies for adding a typo to the linux-kernel address, corrected
now.
On Wed, Oct 07, 2020 at 10:58:00PM -0700, Stephane Eranian wrote:
> Hi Peter,
>
> On Tue, Oct 6, 2020 at 6:17 AM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
> >
> > Hi all,
> >
> > I've been trying to float this idea for a fair number of years, and I
> > think at least Stephane has been talking to tools people about it, but
> > I'm not sure what, if anything, ever happened with it, so let me post it
> > here :-)
> >
> >
> Thanks for bringing this back. This is a pet project of mine and I
> have been looking at it for the last 4 years intermittently now.
> Simply never got a chance to complete because preempted by other
> higher priority projects. I have developed an internal
> proof-of-concept prototype using one of the 3 approaches I know. My
> goal was to demonstrate that PMU statistical sampling of loads/stores
> and with data addresses would work as well as instrumentation. This is
> slightly different from hit/miss in the analysis but the process is
> the same.
>
> As you point out, the difficulty is not so much in collecting the
> sample but rather in symbolizing data addresses from the heap.
Right, that's non-trivial, although for static and per-cpu objects it
should be rather straight forward, heap objects are going to be a pain.
You'd basically have to also log the alloc/free of every object along
with the data type used for it, which is not something we have readily
abailable at the allocator.
> Intel PEBS, IBM Marked Events work well to collect the data. AMD IBS
> works though you get a lot of irrelevant samples due to lack of
> hardware filtering. ARM SPE would work too. Overall, all the major
> architectures will provide the sampling support needed.
That's for the data address, or also the eventing IP?
> Some time ago, I had my intern pursue the other 2 approaches for
> symbolization. The one I see as most promising is by using the DWARF
> information (no BPF needed). The good news is that I believe we do not
> need more information than what is already there. We just need the
> compiler to generate valid DWARF at most optimization levels, which I
> believe is not the case for LLVM based compilers but maybe okay for
> GCC.
Right, I think GCC improved a lot on this front over the past few years.
Also added Andi and Masami, who have worked on this or related topics.
> Once we have the DWARF logic in place then it is easier to improve
> perf report/annotate do to hit/miss or hot/cold, read/write analysis
> on each data type and fields within.
>
> Once we have the code for perf, we are planning to contribute it upstream.
>
> In the meantime, we need to lean on the compiler teams to ensure no
> data type information is lost with high optimizations levels. My
> understanding from talking with some compiler folks is that this is
> not a trivial fix.
As you might have noticed, I send this to the linux-toolchains list.
While you lean on your copmiler folks, try and get them subscribed to
this list. It is meant to discuss toolchain issues as related to Linux.
Both GCC/binutils and LLVM should be represented here.