Harald> but no reason to be too happy about the optimisation of gcc
Harald> (since 17% less wouldn't be that bad for a weird
Harald> f2c-converted strange fp based C program only using REAL*4
Harald> aka float):
gcc is very agressive about aligning branch targets. IMHO, it is way
too agressive. Instruction bandwidth quite easily can become the
bottleneck and putting in tons of NOPs doesn't help in such a case.
For example, with the 21064, executing a linear sequence of code out
of the second-level cache can in the extreme case result in the CPU
being busy for 4 cycles, then it stalling for 9 cycles (waiting for
the prefetch buffer to be transferred into the i-cache) then being
busy for 4 cycles again, just to stall for another 9 cycles. A CPU
utilization of 4/13=30% doesn't help in making things go fast (I
believe these numbers are correct for the 21064, not sure about the
21064A and the 21164 certainly is different).
In any case, as a simple experiment, one could use a gcc configured
for OSF/1 (using the DEC assembler, which ignores the .align
directive, AFAIK) and compare the speed of the code it generates with
that of the Linux gcc (which uses gas, which honors .align
directives). It would be interesting to see what the correlation
between code size and speed would be.
Similarly, notice that -O3 generally results in code that is slower
than -O2. I think it's because gcc is too agressive w.r.t. inlining.
--david