Re: [RFC][PATCH] Faster generic_fls

From: Willy TARREAU (willy@w.ods.org)
Date: Thu May 01 2003 - 12:16:27 EST


On Thu, May 01, 2003 at 10:53:21PM +0800, hugang wrote:
 
> Here is test in my machine, The faster is still table.

because, as Falk said, the tests are incremental and the branch prediction
works very well. I proposed a simple scrambling function based on bswap. Please
consider this function :

      f(i) = i ^ htonl(i) ^ htonl(i<<7)

It returns such values :

0x00000001 => 0x81000001
0x00000002 => 0x02010002
0x00000003 => 0x83010003
0x00000004 => 0x04020004
0x00000005 => 0x85020005
0x00000006 => 0x06030006
0x00000007 => 0x87030007
0x00000008 => 0x08040008
0x00000009 => 0x89040009
0x0000000a => 0x0a05000a
0x0000000b => 0x8b05000b

etc...

As you see, high bits move fast enough to shot a predictor.

The tree function as well as Daniel's "new" resist better to non linear suites.
BTW, the latest changes I did show that the convergence should be between
Daniel's function and mine, because there are common concepts. I noticed that
the expression used in Daniel's function is too complex for gcc to optimize it
well enough. In my case, gcc 3.2.3 coded too many jumps instead of conditional
moves. I saw that playing with -mbranch-cost changes the code. A mix between
the two is used here (and listed after). It's still not optimial, reading the
code, because there's always one useless jump and move. But in the worst case,
it gives exactly the same result as Daniel's. But in other cases, it is even
slightly faster. Honnestly, now it's just a matter of taste. Daniel's easier to
read, mine produces smaller code. This time, it's faster than others Athlon,
Alpha and Sparc. I Don't know about the PentiumIII nor the P4.

Here are the results on Athlon, gcc-3.2.3, then Alpha and Sparc.

Willy

====

4294967295 iterations of bswap/nil... checksum = 4294967295

real 0m53.133s
user 0m53.114s
sys 0m0.005s

4294967295 iterations of bswap/fls_table... checksum = 4294967262

real 1m44.686s
user 1m44.163s
sys 0m0.487s

4294967295 iterations of bswap/new... checksum = 4294967262

real 1m16.463s
user 1m16.144s
sys 0m0.314s

4294967295 iterations of bswap/fls_tree... checksum = 4294967262

real 1m16.511s
user 1m16.169s
sys 0m0.296s

== alpha 466 MHz , gcc-3.2.3
   gcc -O3 -o testfls -fomit-frame-pointer -mcpu=ev67 ===

4294967295 iterations of bswap/fls_tree... checksum = 4294967262

real 4m14.432s
user 4m13.540s
sys 0m0.038s

4294967295 iterations of bswap/fls_table... checksum = 4294967262

real 4m36.204s
user 4m35.368s
sys 0m0.030s

== ultra-sparc 333 MHz, gcc 3.1.1 (linear only)
   gcc -O3 -mcpu=v9 -fomit-frame-pointer ===

4294967295 iterations of fls_tree... checksum = 4294967265

real 1m52.680s
user 1m52.640s
sys 0m0.010s

4294967295 iterations of fls_table... checksum = 4294967265

real 3m24.514s
user 3m24.430s
sys 0m0.010s

======= here is the function :

unsigned fls_tree_fls(unsigned n) {
    register t = 0;

    if (n >= (1<<16)) {
        n >>= 16;
        t = 16;
    }

    if (n < (1<<8))
        if (n < (1<<4))
            if (n < 4)
                return t + n - ((n + 1) >> 2);
            else
                return t + 3 + (n>>3);
        else
            if (n < (1<<6))
                return t + 5 + (n>>5);
            else
                return t + 7 + (n>>7);
    else
        if (n < (1<<12))
            if (n < (1<<10))
                return t + 9 + (n>>9);
            else
                return t + 11 + (n>>11);
        else
            if (n < (1<<14))
                return t + 13 + (n>>13);
            else
                return t + 15 + (n>>15);
}

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Wed May 07 2003 - 22:00:13 EST