Okay, since my first mention of this, I have looked into the asm output
of the two a bit. It appears to be a difference in the asm output of
gcc, and hence this is the last you will see of it on linux-kernel.
I have taken the asm output from a.out gcc-2.5.8, and cleaned it up
so that I could as and ld it to create an ELF binary. You only need to
do this with client.s, as that is where the CPU intensive bit is.
Remove the extra underscores, delete the call to ___main and you are
set. I linked that up to create an ELF binary with the asm output from
the a.out gcc-2.5.8, and it screams along just like the pure a.out binary.
So it is clearly not an a.out vs ELF issue, but rather an issue of
why is gcc now making slower asm code.
Others have also mentioned the difference in alignment (2,4 for a.out
and 4,16 for ELF) so I imposed the same alignment on the asm output
from 2.5.8 by sed'ding the .align 4 --> .align 16 and the .align 2 to
an .align 4. This in turn resulted in a slight increase (12065x -> 12075x)
in performance, making the ELF binary faster than the a.out one since the
alignment now favours the i486 cache design.
I'd have to look at the two asm outputs more closely to see why the
main loop (RC5_KEY_CHECK) is so lame with newer gcc's but I am not
too much of an an asm guru and so I doubt I'd see the problem even
if it was staring me in the face. For what its worth, the number
of asm lines in the fast one is 1067 and 1269 in the slow one.
That in itself is an increase of 19% more instructions. Hrrm,
I wonder if gcc-2.7.2 is slower for other things as well, or if it
is just something this app triggers...
The moral of the story is that the binary is definitely a lot
faster if the binary was created with a pre 2.7 gcc. Both
gcc-2.6.3 and gcc-2.5.8 produce faster asm code for this program,
but gcc-2.7.0 appears to be slow like gcc-2.7.2 is (if not worse).
If you have access to an older gcc, you can home-brew your own ELF
rc5 binary that is much faster than what you are probably using now.
Paul.