a.out binaries that are 66% faster than ELF, problem found?

Ingo Molnar (mingo@pc5829.hil.siemens.at)
Thu, 27 Feb 1997 12:29:03 +0100 (MET)


On Thu, 27 Feb 1997, Marc Lehmann wrote:

> This is long known.. I thought this was because
> elf-shared libraries were somewhat slower, but that cannot account
> for 50%...
>
> The mapping for elf is more complicated (isn't it?),
> but this, too, should not account for 50%...
>
> I doubt that gcc -S will help... gcc's output
> is not significant different when using a.out (unless
> you use -fpic also).
>
> Any ideas? I'd like to have an answer to this question..
> back in the i486 days, the difference was about 10%,
> and nobody cared... but 50% (66% in that case)
> is way to much!

i think we have the answer, as the same problem is causing unjustified
stack overflows in the kernel. GCC is allocating lots of stack slots to
pseudoregisters (correct me if i'm saying something stupid), and the
optimization process removes 90% of them. But GCC doesnt 'compress' the
stack after that. Thus we have 1) a way too big stack 2) possibly
'scattered' stack slots.

a few RL examples from the kernel:

stack size -> 0x5c4 <UMSDOS_ioctl_dir> <----- function name
0x524 <aic7xxx_isr>
0x44c <smb_proc_readdir_long>
0x440 <smb_proc_setattr_trans2>
0x428 <smb_proc_getattr_trans2>
0x428 <aic7xxx_reset_device>
0x348 <vfat_find>
0x2a0 <elf_core_dump>
0x278 <umsdos_find>
0x250 <umsdos_rename_f>
0x24c <umsdos_readdir_x>
0x244 <block_write>
0x244 <block_read>
0x244 <UMSDOS_unlink>
0x22c <umsdos_lookup_x>
0x228 <UMSDOS_link>

for example, aic7xxx_isr() has a real stack usage of 53 bytes, but GCC
generates a 1316 bytes stack for it ...

Additionally, as Paul has already noted, for RC5_KEY_CHECK:

------------------------------------------------------------>
gcc-2.5.8 (fast) gcc-2.7.2.1 (slow)

RC5_KEY_CHECK: RC5_KEY_CHECK:
subl $12,%esp subl $304,%esp
<-----------------------------------------------------------

Look at the stack size difference. Now RC5_CHECK is different, there are
no 'lost stack slots', but lots of spilled registers:

08048e76 <RC5_KEY_CHECK+2e6> movl 0x110(%esp,1),%eax
08048e7d <RC5_KEY_CHECK+2ed> addl 0x804c1a8,%eax
08048e83 <RC5_KEY_CHECK+2f3> addl %ebx,%eax
08048e85 <RC5_KEY_CHECK+2f5> roll $0x3,%eax
08048e88 <RC5_KEY_CHECK+2f8> movl %eax,0x10c(%esp,1)
08048e8f <RC5_KEY_CHECK+2ff> movl %eax,0x804c210
08048e94 <RC5_KEY_CHECK+304> addl 0x10c(%esp,1),%edx
08048e9b <RC5_KEY_CHECK+30b> addl %ebx,%edx
08048e9d <RC5_KEY_CHECK+30d> movl 0x10c(%esp,1),%eax
08048ea4 <RC5_KEY_CHECK+314> addl %ebx,%eax
08048ea6 <RC5_KEY_CHECK+316> movl %eax,%ecx
08048ea8 <RC5_KEY_CHECK+318> roll %cl,%edx
08048eaa <RC5_KEY_CHECK+31a> movl 0x10c(%esp,1),%eax
08048eb1 <RC5_KEY_CHECK+321> addl 0x804c1ac,%eax
08048eb7 <RC5_KEY_CHECK+327> addl %edx,%eax
08048eb9 <RC5_KEY_CHECK+329> roll $0x3,%eax
08048ebc <RC5_KEY_CHECK+32c> movl %eax,0x108(%esp,1)
08048ec3 <RC5_KEY_CHECK+333> movl %eax,0x804c214
08048ec8 <RC5_KEY_CHECK+338> addl 0x108(%esp,1),%ebx

Do you see the pattern? Spilled registers are put into stack slots, but in
a reverse order. The pattern of stack space usage (RC5_KEY_CHECK stack
slot loads only, compiled with 2.7.2.1 + gcc-i486 optimizations, in
program order):

148,14c,154,14,150,13c,158,128,10,124,120,11c,118,114,110,10c,108,104,
100,fc,f8,f4,f0,ec,e8,e4,e0,dc,d8,d4,d0,128,d0,cc,10,cc,10,124,c8,120,
c4,11c,c0,118,bc,114,b8,110,b4,10c,b0,108,ac,104,a8,100,a4,fc,a0,f8,9c,
f4,98,f0,94,ec,90,e8,8c,e4,88,e0,84,dc,80,d8,7c,d4,78,d0,74,cc,70,10,
6c,68,64,c8,64,60,c4,60,5c,c0,5c,58,bc,58,54,b8,54,50,b4,50,4c,b0,4c,
48,ac,48,44,a8,44,40,a4,40,3c,a0,3c,38,9c,38,34,98,34,30,94,30,2c,90,
2c,28,8c,28,24,88,24,20,84,20,1c,80,1c,7c,10,78,10,74,138,70,134,6c,
68,64,60,5c,58,54,50,4c,48,44,40,3c,38,34,30,2c,28,24,20,1c,10,130,18,
142,13c,15c,

[ in the above list, '14c' means that in the assembly code a 'load sp+14c
into a register' happened ]

Do you see the pattern? We use almost all stack space (with those holes at
the beginning), but in the 'core of the core' we use stack space in
reverse order [hitting the memory subsystem in a bad way?]. Additionally
i'm not convinced wether we couldnt do this without stack access (in the
corecore part). Maybe GCC 2.6 manages to put all the corecore
functionality into registers.

These two 'stack problems' are only visible for functions which have lots
of 'complexity' in them. (lots of branches, constants, variable
assignments). But i'm sure this happens for almost all functions ... but
i'm not sure.

For the 'sparse stack problem' i dont know how hard it is to fix this in
GCC, my feeling is that all the abstractions are already there to 'remap'
stack slots, if yes then the fix should be easy. I'd sure play around with
a few PGCC snapshots if such optimization would be implemented ;)

and for the corecore problem, maybe some flag trics GCC into avoiding the
stack? Plus it's not even necessary to allocate that much stack space, it
could reuse stack space (those corecore stack slots are only locally
alive).

It would be cool if a GCC/PGCC guru could comment on this ;)

-- mingo