Re: [PATCH v2 4/8] objtool: add undwarf debuginfo generation

From: Josh Poimboeuf
Date: Mon Jul 10 2017 - 22:58:15 EST


On Fri, Jul 07, 2017 at 11:44:37AM +0200, Ingo Molnar wrote:
>
> * Josh Poimboeuf <jpoimboe@xxxxxxxxxx> wrote:
>
> > On Thu, Jun 29, 2017 at 10:06:52AM -0500, Josh Poimboeuf wrote:
> > > On Thu, Jun 29, 2017 at 04:46:18PM +0200, Ingo Molnar wrote:
> > > >
> > > > * Josh Poimboeuf <jpoimboe@xxxxxxxxxx> wrote:
> > > >
> > > > > > Plus, shouldn't we use __packed for 'struct undwarf' to minimize the
> > > > > > structure's size (to 6 bytes AFAICS?) - or is optimal packing of the main
> > > > > > undwarf array already guaranteed on every platform with this layout?
> > > > >
> > > > > Ah yes, it should definitely be packed (assuming that doesn't affect performance
> > > > > negatively).
> > > >
> > > > So if I count that correctly that should shave another ~1MB off a typical ~4MB
> > > > table size?
> > >
> > > Here's what my Fedora kernel looks like *before* the packed change:
> > >
> > > $ eu-readelf -S vmlinux |grep undwarf
> > > [15] .undwarf_ip PROGBITS ffffffff81f776d0 011776d0 0012d9d0 0 A 0 0 1
> > > [16] .undwarf PROGBITS ffffffff820a50a0 012a50a0 0025b3a0 0 A 0 0 1
> > >
> > > The total undwarf data size is ~3.5MB.
> > >
> > > There are 308852 entries of two parallel arrays:
> > >
> > > * .undwarf (8 bytes/entry) = 2470816 bytes
> > > * .undwarf_ip (4 bytes/entry) = 1235408 bytes
> > >
> > > If we pack undwarf, reducing the size of the .undwarf entries by two
> > > bytes, it will save 308852 * 2 = 617704.
> > >
> > > So the savings will be ~600k, and the typical size will be reduced to ~3MB.
> >
> > Just for the record, while packing the struct from 8 to 6 bytes did save 600k,
> > it also made the unwinder ~7% slower. I think that's probably an ok tradeoff,
> > so I'll leave it packed in v3.
>
> So, out of curiosity, I'm wondering where that slowdown comes from: on modern x86
> CPUs indexing by units of 6 bytes ought to be just as fast as indexing by 8 bytes,
> unless I'm missing something? Is it maybe the not naturally aligned 32-bit words?
>
> Or maybe there's some bad case of a 32-bit word crossing a 64-byte cache line
> boundary that hits some pathological aspect of the CPU? We could probably get
> around any such problems by padding by 2 bytes on 64-byte boundaries - that's only
> a ~3% data size increase. The flip side would be a complication of the data
> structure and its accessors - which might cost more in terms of code generation
> efficiency than it buys us to begin with ...
>
> Also, there's another aspect besides RAM footprint: a large data structure that is
> ~20% smaller means 20% less cache footprint: which for cache cold lookups might
> matter more than the direct computational cost.

tl;dr: Packed really seems to be more like ~2% slower, time for an adult
beverage.

So I tested again with the latest version of my code, and this time
packed was 5% *faster* than unpacked, rather than 7% slower. 'perf
stat' showed that, in both cases, most of the difference was caused by
branch misses in the binary search code. But that code doesn't even
touch the packed struct...

After some hair-pulling/hand-wringing I realized that changing the
struct packing caused GCC to change some of the unwinder code a bit,
which shifted the rest of the kernel's function offsets enough that it
changed the behavior of the unwind table binary search in a way that
affected the CPU's branch prediction. And my crude benchmark was just
unwinding the same stack on repeat, so a small change in the loop
behavior had a big impact on the overall branch predictability.

Anyway, I used some linker magic to temporarily move the unwinder code
to the end of .text, so that unwinder changes don't add unexpected side
effects to the microbenchmark behavior. Now I'm getting more consistent
results: the packed struct is measuring ~2% slower. The slight slowdown
might just be explained by the fact that GCC generates some extra
instructions for extracting the fields out of the packed struct.

In the meantime, I found a ~10% speedup by making the "fast lookup
table" block size a power-of-two (256) to get rid of the need for a slow
'div' instruction.

I think I'm done performance tweaking for now. I'll keep the packed
struct, and add the code for the 'div' removal, and hope to submit v3
soon.

--
Josh