Re: [PATCH 2/3] lib/int_sqrt: Optimize initial value compute

From: Will Deacon
Date: Tue Jul 25 2017 - 07:50:39 EST


Hi Peter,

On Mon, Jul 24, 2017 at 05:16:32PM +0200, Peter Zijlstra wrote:
> The initial value (@m) compute is:
>
> m = 1UL << (BITS_PER_LONG - 2);
> while (m > x)
> m >>= 2;
>
> Which is a linear search for the highest even bit smaller or equal to @x
> We can implement this using a binary search using __fls() (or better
> when its hardware implemented).
>
> m = 1UL << (__fls(x) & ~1UL);
>
> Especially for small values of @x; which are the more common
> arguments; the linear search is near to worst case, while the binary
> search of __fls() is a constant 6 branches.
>
> cycles: branches: branch-misses:
>
> PRE:
>
> hot: 43.633557 +- 0.034373 45.333132 +- 0.002277 0.023529 +- 0.000681
> cold: 207.438411 +- 0.125840 45.333132 +- 0.002277 6.976486 +- 0.004219
>
> SOFTWARE FLS:
>
> hot: 29.576176 +- 0.028850 26.666730 +- 0.004511 0.019463 +- 0.000663
> cold: 165.947136 +- 0.188406 26.666746 +- 0.004511 6.133897 +- 0.004386
>
> HARDWARE FLS:
>
> hot: 24.720922 +- 0.025161 20.666784 +- 0.004509 0.020836 +- 0.000677
> cold: 132.777197 +- 0.127471 20.666776 +- 0.004509 5.080285 +- 0.003874
>
> Averages computed over all values <128k using a LFSR to generate
> order. Cold numbers have a LFSR based branch trace buffer 'confuser'
> ran between each int_sqrt() invocation.

The hardware fls version works nicely for arm64, where it can be implemented
using the clz instruction (via the __builtin_clzl intrinsic).

Acked-by: Will Deacon <will.deacon@xxxxxxx>

Cheers,

Will