Re: [patch 2/7] lib/hashmod: Add modulo based hash mechanism

From: George Spelvin
Date: Sat Apr 30 2016 - 16:52:42 EST


Thomas Gleixner wrote:
> I'll send a patch to replace hash_64 and hash_32.

Before you do that, could we look for a way to tweak the constants
in the existing hash?

It seems the basic "take the high bits of x * K" algorithm is actually
a decent hash function if K is chosen properly, and has a significant
speed advantage on machines with half-decent multipliers.

I'm researching how to do the multiply with fewer shifts and adds on
machines that need it. (Or we could use a totally different function
in that case.)

You say that
> hash64 is slightly faster as the modulo prime as it does not have the
> multiplication.

Um... are you sure you benchmarked that right? The hash_64 code you
used (Thomas Wang's 64->32-bit hash) has a critical path consisting of 6
shifts and 7 adds. I can't believe that's faster than a single multiply.

For 1,000,000 iterations on an Ivy Bridge, the multiply is 4x
faster (5x if out of line) for me!

The constants I recommend are
#define GOLDEN_RATIO_64 0x61C8864680B583EBull
#define GOLDEN_RATIO_32 0x61C88647


rdtsc times for 1,000,000 iterations of each of the two.
(The sum of all hashes is printed to prevent dead code elimination.)

hash_64 (sum) * PHI (sum) hash_64 (sum) * PHI (sum)
17552154 a52752df 3431821 2ce5398c 17485381 a52752df 3375535 2ce5398c
17522273 a52752df 3487206 2ce5398c 17551217 a52752df 3374221 2ce5398c
17546242 a52752df 3377355 2ce5398c 17494306 a52752df 3374202 2ce5398c
17495702 a52752df 3409768 2ce5398c 17505839 a52752df 3398205 2ce5398c
17501114 a52752df 3375435 2ce5398c 17539388 a52752df 3374202 2ce5398c
And with hash_64 forced inline:
13596945 a52752df 3374931 2ce5398c 13585916 a52752df 3411107 2ce5398c
13564616 a52752df 3374928 2ce5398c 13573465 a52752df 3425160 2ce5398c
13569712 a52752df 3374915 2ce5398c 13580461 a52752df 3397773 2ce5398c
13577481 a52752df 3374912 2ce5398c 13558708 a52752df 3417456 2ce5398c
13569044 a52752df 3374909 2ce5398c 13557193 a52752df 3407912 2ce5398c

That's 3.5 cycles vs. 13.5.

(I actually have two copies of the inlined code, to show code alignment
issues.)

On a Phenom, it's worse, 4 cycles vs. 35.
35083119 a52752df 4020754 2ce5398c 35068116 a52752df 4015659 2ce5398c
35074377 a52752df 4000819 2ce5398c 35068735 a52752df 4016943 2ce5398c
35067596 a52752df 4025397 2ce5398c 35074365 a52752df 4000108 2ce5398c
35071050 a52752df 4016190 2ce5398c 35058775 a52752df 4017988 2ce5398c
35055091 a52752df 4000066 2ce5398c 35201158 a52752df 4000094 2ce5398c




My simple test code appended for anyone who cares...

#include <stdint.h>
#include <stdio.h>

/* Phi = 0x0.9E3779B97F4A7C15F... */
/* -Phi = 0x0.61C8864680B583EA1... */
#define K 0x61C8864680B583EBull

static inline uint32_t hash1(uint64_t key)
{
key = ~key + (key << 18);
key ^= key >> 31;
key += (key << 2) + (key << 4);
key ^= key >> 11;
key += key << 6;
key ^= key >> 22;
return (uint32_t)key;
}

static inline uint32_t hash2(uint64_t key)
{
return (uint32_t)(key * K >> 32);
}

static inline uint64_t rdtsc(void)
{
uint32_t lo, hi;
asm volatile("rdtsc" : "=a" (lo), "=d" (hi));
return (uint64_t)hi << 32 | lo;
}

int
main(void)
{
int i, j;
uint32_t sum, sums[20];
uint64_t start, times[20];

for (i = 0; i < 20; i += 4) {
sum = 0;
start = rdtsc();
for (j = 0; j < 1000000; j++)
sum += hash1(j+0xdeadbeef);
times[i] = rdtsc() - start;
sums[i] = sum;

sum = 0;
start = rdtsc();
for (j = 0; j < 1000000; j++)
sum += hash2(j+0xdeadbeef);
times[i+1] = rdtsc() - start;
sums[i+1] = sum;

sum = 0;
start = rdtsc();
for (j = 0; j < 1000000; j++)
sum += hash1(j+0xdeadbeef);
times[i+2] = rdtsc() - start;
sums[i+2] = sum;

sum = 0;
start = rdtsc();
for (j = 0; j < 1000000; j++)
sum += hash2(j+0xdeadbeef);
times[i+3] = rdtsc() - start;
sums[i+3] = sum;
}
for (i = 0; i < 20; i++)
printf(" %llu %08x%c",
times[i], sums[i], (~i & 3) ? ' ' : '\n');
return 0;
}