Re: [RFC][PATCH] Faster generic_fls

From: Willy Tarreau (willy@w.ods.org)
Date: Wed Apr 30 2003 - 02:19:07 EST


Hi Daniel,

On Wed, Apr 30, 2003 at 04:46:23AM +0200, Daniel Phillips wrote:
> Here's a faster implementation of generic_fls, that I discovered accidently,
> by not noticing 2.5 already had a generic_fls, and so I rolled my own. Like
> the incumbent, it's O log2(bits), but there's not a lot of resemblance beyond
> that. I think the new algorithm is inherently more parallelizable than the
> traditional approach. A processor that can speculatively evaluate both sides
> of a conditional would benefit even more than the PIII I tested on.

I did some tests on an Athlon MP2200 (1.8 GHz) and a Pentium 4/2.0 GHz, with
both gcc 2.95.3 and gcc 3.2.3.

The results are interesting, because they show that the compiler used is far
more important than the code ! To resume quickly, your code is faster than the
old one on an athlon, gcc-2.95.2 -O3, and in any combination of gcc-3.2.3.
On a P4, new code compiled with 3.2.3 is still slower than old code with 2.95.3.

But gcc's optimizer seems to be even worse with each new version. The NIL
checksum takes 3 times more to complete on gcc3/athlon than gcc2/athlon !!!
And on some test, the P4 goes faster when the code is optimized for athlon than
for P4 !

Oh, and the new2 function is another variant I have been using by the past, but
it is slower here. It performed well on older machines with gcc 2.7.2. Its main
problem may be the code size, due to the 32 bits masks. Your code has the
advantage of being short and working on small data sets.

#define new2_fls32(___a) ({ \
    register int ___x, ___bits = 0; \
    if (___a) { \
        ___x = (___a); \
        ++___bits; \
        if (___x & 0xffff0000) { ___x &= 0xffff0000; ___bits += 16;} \
        if (___x & 0xff00ff00) { ___x &= 0xff00ff00; ___bits += 8;} \
        if (___x & 0xf0f0f0f0) { ___x &= 0xf0f0f0f0; ___bits += 4;} \
        if (___x & 0xcccccccc) { ___x &= 0xcccccccc; ___bits += 2;} \
        if (___x & 0xaaaaaaaa) { ___x &= 0xaaaaaaaa; ___bits += 1;} \
    } \
    ___bits; \
    })

Here are the results.

Cheers,
Willy

===== gcc-2.95.3 -march=i686 -O2 / athlon MP/2200 (1.8 GHz) =====

4294967295 iterations of nil... checksum = 4294967295

real 0m4.778s
user 0m4.770s
sys 0m0.010s
4294967295 iterations of old... checksum = 4294967265

real 0m37.923s
user 0m37.920s
sys 0m0.000s
4294967295 iterations of new... checksum = 4294967265

real 0m47.956s
user 0m47.920s
sys 0m0.030s
4294967295 iterations of new2... checksum = 4294967295

real 0m43.884s
user 0m43.860s
sys 0m0.020s

===== gcc-2.95.3 -march=i686 -O3 / athlon MP/2200 =====

4294967295 iterations of nil... checksum = 4294967295

real 0m4.779s
user 0m4.770s
sys 0m0.010s
4294967295 iterations of old... checksum = 4294967265

real 0m39.235s
user 0m39.180s
sys 0m0.050s
4294967295 iterations of new... checksum = 4294967265

real 0m32.175s
user 0m32.170s
sys 0m0.000s
4294967295 iterations of new2... checksum = 4294967295

real 0m37.560s
user 0m37.520s
sys 0m0.020s

===== gcc-2.95.3 -march=i686 -O3 / Pentium4 2.0 GHz =====

4294967295 iterations of nil... checksum = 4294967295

real 0m4.370s
user 0m4.370s
sys 0m0.000s
4294967295 iterations of old... checksum = 4294967265

real 0m22.148s
user 0m22.150s
sys 0m0.000s
4294967295 iterations of new... checksum = 4294967265

real 0m25.854s
user 0m25.850s
sys 0m0.000s
4294967295 iterations of new2... checksum = 4294967295

real 0m30.743s
user 0m28.930s
sys 0m0.160s

==== gcc 3.2.3 -march=i686 -O3 / Pentium 4 / 2.0 GHz ====

4294967295 iterations of nil... checksum = 4294967295

real 0m6.719s
user 0m6.710s
sys 0m0.000s
4294967295 iterations of old... checksum = 4294967265

real 0m45.571s
user 0m43.510s
sys 0m0.180s
4294967295 iterations of new... checksum = 4294967265

real 0m28.251s
user 0m26.420s
sys 0m0.010s
4294967295 iterations of new2... checksum = 4294967265

real 0m48.881s
user 0m47.080s
sys 0m0.000s
-> awful, 60% more time than with gcc-2.95.3 !

==== gcc 3.2.3 -march=pentium4 -O3 / Pentium 4 / 2.0 GHz ====
(yes, gcc 3.2 may be producing better code... for the eyes, not
for the CPU)

4294967295 iterations of nil... checksum = 4294967295

real 0m4.422s
user 0m4.420s
sys 0m0.000s
4294967295 iterations of old... checksum = 4294967265

real 0m34.950s
user 0m34.950s
sys 0m0.000s
4294967295 iterations of new... checksum = 4294967265

real 0m27.667s
user 0m25.810s
sys 0m0.000s
4294967295 iterations of new2... checksum = 4294967295

real 0m42.945s
user 0m42.760s
sys 0m0.160s

==== gcc 3.2.3 -march=athlon-mp -O3 / Pentium 4 / 2.0 GHz ====

4294967295 iterations of nil... checksum = 4294967295

real 0m6.486s
user 0m6.490s
sys 0m0.000s
4294967295 iterations of old... checksum = 4294967265

real 0m41.240s
user 0m39.220s
sys 0m0.160s
4294967295 iterations of new... checksum = 4294967265

real 0m25.780s
user 0m25.780s
sys 0m0.000s
4294967295 iterations of new2... checksum = 4294967295

real 0m51.789s
user 0m49.750s
sys 0m0.010s

==== gcc 3.2.3 -march=athlon-mp -O3 / Athlon MP2200 / 1.8 GHz ====

4294967295 iterations of nil... checksum = 4294967295

real 0m14.544s
user 0m14.540s
sys 0m0.000s
4294967295 iterations of old... checksum = 4294967265

real 0m44.609s
user 0m44.600s
sys 0m0.000s
4294967295 iterations of new... checksum = 4294967265

real 0m26.765s
user 0m26.760s
sys 0m0.000s
4294967295 iterations of new2... checksum = 4294967295

real 1m6.066s
user 1m6.030s
sys 0m0.030s

==== gcc 3.2.3 -march=pentium4 -O3 / Athlon MP2200 / 1.8 GHz ====

4294967295 iterations of nil... checksum = 4294967295

real 0m10.775s
user 0m10.750s
sys 0m0.030s
4294967295 iterations of old... checksum = 4294967265

real 0m45.837s
user 0m45.830s
sys 0m0.000s
4294967295 iterations of new... checksum = 4294967265

real 0m24.414s
user 0m24.410s
sys 0m0.010s
4294967295 iterations of new2... checksum = 4294967295

real 1m7.299s
user 1m7.280s
sys 0m0.010s

==== gcc 3.2.3 -march=i686 -O3 / Athlon MP2200 / 1.8 GHz ====

4294967295 iterations of nil... checksum = 4294967295

real 0m13.010s
user 0m13.000s
sys 0m0.010s

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Wed Apr 30 2003 - 22:00:33 EST