Re: [patch 2/7] lib/hashmod: Add modulo based hash mechanism

From: George Spelvin
Date: Sun May 01 2016 - 05:44:06 EST

> Sorry I did not express myself clear enough.

> hash64 (the single multiply with the adjusted golden ratio) is
> slightly faster than the modulo one which has two mutiplications.

Yes, I figured out that was probably what you were talking about,
and benchmarked it similarly.

But I noticed a much greater difference.

Wang * PHI % 4093 Wang * PHI % 4093
13599486 3494816 5238442 13584166 3460266 5239463
13589552 3466764 5237327 13582381 3422304 5276253
13569409 3407317 5236239 13566738 3393215 5267909
13575257 3413736 5237708 13596428 3379811 5280118
13583607 3540416 5325609 13650964 3380301 5265210

At 3.7 GHz, that's

* PHI: 1059 M ops/second
* Modulo: 706 M ops/second
* Wang: 271 M ops/second

Of course, that's a tight loop hashing; I presume your test case
has more overhead.

Anyway, my recommendation (I'll write the patches if you like) is:

* Modify the multiplicative constants to be
#define COLDEN_RATIO_32 0x61C88647
#define COLDEN_RATIO_64 0x61C8864680B583EB

* Change the prototype of hash_64 to return u32.

* Create a separate 32-bit implementation of hash_64() for the
BITS_PER_LONG < 64 case. This should not be Wang's or anything
similar because 64-bit shifts are slow and awkward on 32-bit
One option is something like __jhash_final(), but I think
it will suffice to do:

static __always_inline u32 hash_64(u64 val, unsigned int bits)
u32 hash = (u32)(val >> 32) * GOLDEN_RATIO_32 + (u32)val;
hash *= GOLDEN_RATIO_32;
return hash >> (32 - bits);
(S-o-b on that if you want it, of course.)

(If you want it out of line, make a helper function that
computes the 32-bit hash and have an inline wrapper that
does the shifting.)

* Eliminate the !CONFIG_ARCH_HAS_FAST_MULTIPLIER code path. Once we've
got rid of the 32-bit processors, we're left with 64-bit processors,
and they *all* have hardware multipliers. Even the MIPS R4000 (first
64-bit processor ever) and Alpha 21064 had one.

(Admittedly, the R4000 was 20 cycles, which is slower than Wang's hash,
but 18 of them could overlap other instructions.)

The worst multiply support is SPARCv9, which has 64x64-bit
multiply with a 64-bit result, but no widening multiply.

* If you feel ambitious, add a 32-bit CONFIG_ARCH_HAS_SLOW_MULTIPLIER
exception path.