Re: [PATCH] lib/int_sqrt.c: Optimize square root function

From: Peter Zijlstra
Date: Fri Jul 21 2017 - 07:40:59 EST


On Thu, Jul 20, 2017 at 04:24:32PM -0700, Linus Torvalds wrote:

> So mayube something like the attached?
>
> NOTE NOTE NOTE! This may be pure garbage. It's untested. But the code
> generation looks ok even if gcc is being way too clever and turns that
> first loop into a counted loop instead. Whatever, maybe it's the right
> thing to do.

It passes a -DVALIDATE=1 run (up to a billion or so), so it seems to work.


I've made the test thing a bit more complicated in order to address your
concerns for primed branch predictors (and once the branch predictors
get wiped, the predictable input values don't matter anymore either I
think, although I could easily LFSR generate them as well of course).

Find it attached in a tarball.


The results, from running ./test.sh (left SKL, right SNB):

(values <100k)

Cycles:


EVENT=0 -DLINUS=1
event: 29.533080 +- 0.046261 36.800100 +- 0.046067
EVENT=0 -DLINUS=1 -DWIPE_BTB=1
event: 143.513440 +- 0.152651 153.054640 +- 0.099403
EVENT=0 -DNEW=1 -DANSHUL=1
event: 41.457300 +- 0.048409 55.046100 +- 0.044968
EVENT=0 -DNEW=1 -DANSHUL=1 -DWIPE_BTB=1
event: 128.351400 +- 0.366873 161.183690 +- 0.132301
EVENT=0 -DNEW=1 -DANSHUL=1 -DFLS=1
event: 21.433170 +- 0.043859 27.672700 +- 0.063982
EVENT=0 -DNEW=1 -DANSHUL=1 -DFLS=1 -DWIPE_BTB=1
event: 79.850210 +- 0.133646 112.422470 +- 0.101465
EVENT=0 -DNEW=1 -DANSHUL=1 -DFLS=1 -DSOFTFLS=1
event: 31.771680 +- 0.037655 37.515160 +- 0.080305
EVENT=0 -DNEW=1 -DANSHUL=1 -DFLS=1 -DSOFTFLS=1 -DWIPE_BTB=1
event: 129.673440 +- 0.430308 125.514390 +- 0.117590


Branches:


EVENT=4 -DLINUS=1
event: 31.507740 +- 0.010470 31.507710 +- 0.010470
EVENT=4 -DLINUS=1 -DWIPE_BTB=1
event: 31.507720 +- 0.010470 31.507760 +- 0.010470
EVENT=4 -DNEW=1 -DANSHUL=1
event: 45.125510 +- 0.002696 45.125510 +- 0.002696
EVENT=4 -DNEW=1 -DANSHUL=1 -DWIPE_BTB=1
event: 45.125490 +- 0.002696 45.125540 +- 0.002697
EVENT=4 -DNEW=1 -DANSHUL=1 -DFLS=1
event: 21.252320 +- 0.005266 21.252330 +- 0.005266
EVENT=4 -DNEW=1 -DANSHUL=1 -DFLS=1 -DWIPE_BTB=1
event: 21.252320 +- 0.005266 21.252340 +- 0.005266
EVENT=4 -DNEW=1 -DANSHUL=1 -DFLS=1 -DSOFTFLS=1
event: 25.727940 +- 0.005975 25.727930 +- 0.005975
EVENT=4 -DNEW=1 -DANSHUL=1 -DFLS=1 -DSOFTFLS=1 -DWIPE_BTB=1
event: 25.727940 +- 0.005975 25.727930 +- 0.005975


Branch-Misses:


EVENT=5 -DLINUS=1
event: 0.022670 +- 0.000789 0.020920 +- 0.000812
EVENT=5 -DLINUS=1 -DWIPE_BTB=1
event: 5.481750 +- 0.006930 5.955130 +- 0.005228
EVENT=5 -DNEW=1 -DANSHUL=1
event: 0.025990 +- 0.000798 0.017170 +- 0.000690
EVENT=5 -DNEW=1 -DANSHUL=1 -DWIPE_BTB=1
event: 3.343600 +- 0.004249 5.723570 +- 0.004876
EVENT=5 -DNEW=1 -DANSHUL=1 -DFLS=1
event: 0.019040 +- 0.000731 0.022570 +- 0.000829
EVENT=5 -DNEW=1 -DANSHUL=1 -DFLS=1 -DWIPE_BTB=1
event: 2.156350 +- 0.004643 4.297650 +- 0.004910
EVENT=5 -DNEW=1 -DANSHUL=1 -DFLS=1 -DSOFTFLS=1
event: 0.019800 +- 0.000747 0.016780 +- 0.000688
EVENT=5 -DNEW=1 -DANSHUL=1 -DFLS=1 -DSOFTFLS=1 -DWIPE_BTB=1
event: 4.481360 +- 0.005505 4.518300 +- 0.004877




Which still doesn't explain your hatred of fls(), as even with cold
branches and a software fls, its faster than your latest proposal (as
can be explained by the lower total number of branches for the software
fls variant compared to your proposal).


And when we start to look at larger values, the fls() one pulls even
further ahead.

So I'll stick with my proposal... I can write up a proper Changelog and
maybe add my test to tools/testing/sqrt/ if you think its worth it.


---
lib/int_sqrt.c | 11 ++++++++---
1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/lib/int_sqrt.c b/lib/int_sqrt.c
index 1ef4cc344977..611933760225 100644
--- a/lib/int_sqrt.c
+++ b/lib/int_sqrt.c
@@ -7,12 +7,13 @@

#include <linux/kernel.h>
#include <linux/export.h>
+#include <linux/bitops.h>

/**
- * int_sqrt - rough approximation to sqrt
+ * int_sqrt - Computes the integer square root
* @x: integer of which to calculate the sqrt
*
- * A very rough approximation to the sqrt() function.
+ * Computes: floor(sqrt(@x))
*/
unsigned long int_sqrt(unsigned long x)
{
@@ -21,7 +22,11 @@ unsigned long int_sqrt(unsigned long x)
if (x <= 1)
return x;

- m = 1UL << (BITS_PER_LONG - 2);
+ m = 1UL << (__fls(x) & ~1U);
+
+ while (m > x)
+ m >>= 2;
+
while (m != 0) {
b = y + m;
y >>= 1;

Attachment: sqrt.tar
Description: Unix tar archive