Re: 2.2.13 & gcc-2.95.1

Mikael Pettersson (mikpe@csd.uu.se)
Mon, 20 Sep 1999 04:04:43 +0200 (MET DST)


[Somewhat off-topic, but gcc problems are relevant for kernel folks too.]

I Lee Hetherington wrote:

>Mikael Pettersson wrote:
>
>> This patch for 2.2.x (originally by Artur Skawina for 2.3.x)
>> eliminates at least one problem of using gcc-2.95, namely its
>> utterly stupid and unnecessary 16-byte stack alignment default.
>
>Why is it utterly stupid and unnecessary?

It's unnecessary because the desired effect, stricter alignment of
certain data objects, can be achieved manually when and where it matters.
It's stupid because the default mode "optimizes" for a minority case
(floating-point and Streaming SIMD-intensive applications) at the
expense of integer/pointer/procedure-call-intensive programs, which,
I claim, are in the majority.

I originally discovered this problem when investigating a library
which broke as I upgraded from egcs-1.1.2 to gcc-2.95. At one point,
I diff:ed the assembly files generated by the two compilers; to my horror,
gcc-2.95 generated about 10% more lines of assembly code and bytes of
object code than egcs-1.1.2. All of the additional code was involved
in maintaining oversize 16-byte aligned stack frames.

If we look in gcc-2.95's info files, we find this explanation:

>`-mpreferred-stack-boundary=NUM'
> Attempt to keep the stack boundary aligned to a 2 raised to NUM
> byte boundary. If `-mpreferred-stack-boundary' is not specified,
> the default is 4 (16 bytes or 128 bits).
>
> The stack is required to be aligned on a 4 byte boundary. On
> Pentium and PentiumPro, `double' and `long double' values should be
> aligned to an 8 byte boundary (see `-malign-double') or suffer
> significant run time performance penalties. On Pentium III, the
> Streaming SIMD Extention (SSE) data type `__m128' suffers similar
> penalties if it is not 16 byte aligned.

So the processor prefers stricter-than-4-byte alignment for some data
types, or some operations will run slower. Unless your code is f.p.
or SSE intensive, you won't notice.

>
> To ensure proper alignment of this values on the stack, the stack
> boundary must be as aligned as that required by any value stored
> on the stack. Further, every function must be generated such that
> it keeps the stack aligned.

This is simply not true.

1) Proper alignment of these data types can easily be achieved manually,
without imposing any restrictions on the "stack boundary" (i.e. SP/FP).
It is not necessary to keep SP 16-byte aligned at each procedure call.

char buffer[N*sizeof(T)+sizeof(T)-1];
T *arr = (T*)(((unsigned long)buffer+sizeof(T)-1) & ~(sizeof(T)-1));
/* now rejoice knowing arr[0..N-1] are all properly aligned */

2) Application programmers that _really_ need FP or SSE performance
are usually no newbies. They know that they may have to adapt/tweak
their "codes" and "kernels" (FORTRAN-speak) to the hardware at hand.
At the very least, they know how to add "-mpreferred-stack-boundary=4"
to their Makefiles.

> Thus calling a function compiled with
> a higher preferred stack boundary from a function compiled with a
> lower preferred stack boundary will most likely misalign the
> stack. It is recommended that libraries that use callbacks always
> use the default setting.
>
> This extra alignment does consume extra stack space.

3) The additional code to maintain 16-byte stack alignment also
burdens the I-cache and ITLB, and it does need processor cycles
to execute.

I Lee Hetherington again:

>... Have you benchmarked it both ways?
>
>I have benchmarked various programs and they do run faster with the 16-byte stack alignment on Pentium IIIs.

Benchmark results will obviously depend on what the application code does.
I did benchmark an application of mine (an implementation of a programming
language which is translated into C before being compiled), using gcc-2.95.1
with and without "-mpreferred-stack-boundary=2" on a P5MMX. (My new PIII box
doesn't work yet.) The results were quite conclusive: the default 16-byte
aligned stacks cost 2% in code size and between 5 and 10% at runtime. YMMV.

In my opinion, gcc ought to implement __attribute__((aligned(BOUNDARY)))
for local stacked variables. (In gcc-2.95.1 the attribute appears to
be ignored for local variables. Pity.) It's not difficult, and it would
have lots of advantages over a global mode option.

/Mikael

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/