Re: Semaphore assembly-code bug

From: Andi Kleen
Date: Sat Oct 30 2004 - 19:42:49 EST


> I personally am a _huge_ believer in small code.
>
> The sequence
>
> popl %eax
> popl %ecx
> popl %edx
> popl %eax
>
> is four bytes. In contrast, the three moves and an add is 15 bytes. That's
> almost 4 times as big.

Using the long stack setup code was found to be a significant
win when enough registers were saved (several percent in real benchmarks)
on K8 gcc. It speed up all function calls considerably because it
eliminates several stalls for each function entry/exit. The popls
will all depend on each other because of their implicied reference
to esp.

Yes, it bloats the code, but function calls happen so often that having them
faster is really noticeable.

The K8 has quite big caches and is not decoding limited, so it
wasn't a too bad tradeoff there.

Ideally you would want to only do it on hot functions and optimize
rarely called functions for code size, but that would require profile
feedback which is often not feasible (JITs have an advantage here)

Unfortunately I don't think it is practically feasible for the kernel because
we rely on to be able to recreate the same vmlinuxs for debugging.
[It's a pity actually because modern compilers do a lot better
with profile feedback]

On P4 on the other hand it doesn't help at all and only makes
the code bigger. I did it from hand in the x86-64 syscall
code too (that was before there was EM64T, but I still think it was a
good idea). Perhaps AMD adds special hardware in some future CPU that
also makes it unnecessary, but currently it's like this and it helps.

-Andi

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/