Re: [RFC PATCH] x86, Add debug option to force all data sections aligned

From: Denys Vlasenko
Date: Fri Sep 24 2021 - 04:13:50 EST

On 9/23/21 4:57 PM, Feng Tang wrote:
On Wed, Sep 22, 2021 at 11:51:37AM -0700, Josh Poimboeuf wrote:
Hi Feng,

Thanks for the interesting LPC presentation about alignment-related
performance issues (which mentioned this patch).

I wonder if we can look at enabling some kind of data section alignment
unconditionally instead of just making it a debug option. Have you done
any performance and binary size comparisons?
Thanks for reviewing this!

For binary size, I just tested 5.14 kernel with a default desktop
config from Ubuntu (I didn't use the normal rhel-8.3 config used
by 0Day, which is more for server):

text data bss dec hex filename
16010221 14971391 6098944 37080556 235cdec vmlinux

v5.14 + 64B-function-align
text data bss dec hex filename
18107373 14971391 6098944 39177708 255cdec vmlinux

v5.14 + data-align(THREAD_SIZE 16KB)
text data bss dec hex filename
16010221 57001791 6008832 79020844 4b5c32c vmlinux

So for the text-align, we see 13.1% increase for text. And for data-align,
there is 280.8% increase for data.

Page-size alignment of all data is WAY too much. At most, alignment
to cache line size should work to make timings stable.
(In your case with "adjacent cache line prefetcher",
it may need to be 128 bytes. But definitely not 4096 bytes).

Performance wise, I have done some test with the force-32bytes-text-align
option before (v5.8 time), for benchmark will-it-scale, fsmark, hackbench,
netperf and kbuild:
* no obvious change for will-it-scale/fsmark/kbuild
* see both regression/improvement for different hackbench case
* see both regression/improvement for netperf, from -20% to +98%

What usually happens here is that testcases are crafted to measure
how well some workloads scale, and to measure that efficiently,
testcases were intentionally written to cause congestion -
this way, benefits of better algorithms are easily seen.

However, this also means that in the congested scenario (e.g.
cache bouncing), small changes in CPU architecture are also
easily visible - including cases where optimizations are going awry.

In your presentation, you stumbled upon one such case:
the "adjacent cache line prefetcher" is counter-productive here,
it pulls unrelated cache into the CPU, not knowing that
this is in fact harmful - other CPUs will need this cache line,
not this one!

Since this particular case was a change in structure layout,
increasing alignment of .data sections won't help here.

My opinion is that we shouldn't worry about this too much.
Diagnose the observed slow downs, if they are "real"
(there is a way to improve), fix that, else if they are spurious,
just let them be.

Even when some CPU optimizations are unintentionally hurting some
benchmarks, on the average they are usually a win:
CPU makers have hundreds of people looking at that as their
full-time jobs. With your example of "adjacent cache line prefetcher",
CPU people might be looking at ways to detect when these
speculatively pulled-in cache lines are bouncing.

For data-alignment, it has huge impact for the size, and occupies more
cache/TLB, plus it hurts some normal function like dynamic-debug. So
I'm afraid it can only be used as a debug option.

On a similar vein I think we should re-explore permanently enabling
cacheline-sized function alignment i.e. making something like
CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B the default. Ingo did some
research on that a while back:

Thanks for sharing this, from which I learned a lot, and I hope I
knew this thread when we first check strange regressions in 2019 :)

At the time, the main reported drawback of -falign-functions=64 was that
even small functions got aligned. But now I think that can be mitigated
with some new options like -flimit-function-alignment and/or
-falign-functions=64,X (for some carefully-chosen value of X).

-falign-functions=64,7 should be about right, I guess.

defconfig vmlinux (w/o FRAME_POINTER) has 42141 functions.
6923 of them have 1st insn 5 or more bytes long,
5841 of them have 1st insn 6 or more bytes long,
5095 of them have 1st insn 7 or more bytes long,
786 of them have 1st insn 8 or more bytes long,
548 of them have 1st insn 9 or more bytes long,
375 of them have 1st insn 10 or more bytes long,
73 of them have 1st insn 11 or more bytes long,
one of them has 1st insn 12 bytes long:
this "heroic" instruction is in local_touch_nmi()
65 48 c7 05 44 3c 00 7f 00 00 00 00
movq $0x0,%gs:0x7f003c44(%rip)

Thus ensuring that at least seven first bytes do not cross
64-byte boundary would cover >98% of all functions.