__asm__("andl $0xffffffe0, %esp");
this aligns the inner stack to 32 bytes. Generally you want to 'cut off'
the return address from the 'active cache lines', thus do this too:
__asm__("subl $12, %esp");
depending on how much parameters there are, the cutoff should align the
first active stack slot to a cache line.
On GCC 2.7.2.1, stacks are already 4 bytes aligned, but not 8 bytes
aligned. Just change 12 in steps of 4 to see the effect of this. I saw a
slight (1-2%) performance increase when meeting the right alignment.
To see how catastrophic bad stack alignment is, do:
__asm__("subl $1, %esp");
and see speed getting halved ;)
So i would say stack alignment alone does not explain the performance
difference, at least not on the pentium systems i've tested it. Stack
alignment isnt perfect currently, but there are so many spilled registers
in the core function on x86 that it doesnt make much difference.
-- mingo