Re: [PATCH] speed up on find_first_bit for i386 (let compiler dothe work)

From: Maciej W. Rozycki
Date: Fri Jul 29 2005 - 09:41:40 EST


On Thu, 28 Jul 2005, Linus Torvalds wrote:

> There may be more upsides on other architectures (*cough*ia64*cough*) that
> have strange scheduling issues and other complexities, but on x86 in
> particular, the __builtin_xxx() functions tend to be a lot more pain than
> they are worth. Not only do they have strange limitations (on selection of

They can be buggy, sure, just like any code. If they indeed are we may
want to avoid them rather than requiring a GCC upgrade, of course.

> opcodes but also for compiler versions), but they aren't well documented,
> and semantics aren't clear.

Hmm, that's what's in the GCC info pages for the relevant functions
(I've omitted the "l" and "ll" variants):

"-- Built-in Function: int __builtin_ffs (unsigned int x)
Returns one plus the index of the least significant 1-bit of X, or
if X is zero, returns zero.

-- Built-in Function: int __builtin_clz (unsigned int x)
Returns the number of leading 0-bits in X, starting at the most
significant bit position. If X is 0, the result is undefined.

-- Built-in Function: int __builtin_ctz (unsigned int x)
Returns the number of trailing 0-bits in X, starting at the least
significant bit position. If X is 0, the result is undefined."

If that's not enough, then what would be? I'm serious -- if you find it
inadequate, then perhaps it could be improved.

> In contrast, the gcc builtins probably match some standard that is not
> only harder to find, but also has some _other_ definition for what happens
> for the zero case, so the builtins automatically end up having problems
> due to semantic mis-match between the CPU and the standard.

GCC should know the semantics of underlying CPU instructions used and be
able to optimize expressions for common cases, e.g. like:

return x == 0 ? 32 : __builtin_ctz(x);

when the CPU provides a "ctz" operation that returns 32 for 0.

> Basic rule: inline assembly is _better_ than random compiler extensions.
> It's better to have _one_ well-documented extension that is very generic
> than it is to have a thousand specialized extensions.

It depends on how many submodel-specific variants of inline assembly you
need, how many reloads are required for constraints, possibly defeating
the gain, etc.

In this particular case "bsf" and "bsr" are notoriously slow for some
i386 submodels, so using the generic O(log n) algorithm may result in
better performance for them. E.g. the execution time for "bsf" for the
original i386 is 11 + 3n clock ticks (n refers to the resulting bit
index), "bsr" for the i486 is 7 - 104 ticks (so again 3n), for Pentium --
7 - 72 ticks (2n, then). It does not immediately mean they are worse, but
they are slow enough for the pessimistic case checking alternatives is not
unreasonable.

Maciej
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/