Re: [PATCH v3 06/10] fs/namei.c: Improve dcache hash function

From: George Spelvin
Date: Mon May 30 2016 - 14:10:29 EST

On Mon, 30 May 2016 at 18:27:21 +0200, Peter Zijlstra wrote:
> On Mon, May 30, 2016 at 12:06:18PM -0400, George Spelvin wrote:
> Right; as stated performance really isn't a goal here.

I understand, but 64x64-bit multiply on 32-bit is pretty annoyingly
expensive. In time, code size, and register pressure which bloats
surrounding code.

>> If performance mattered, I'd be inclined to use one or two iterations
>> of the 32-bit HASH_MIX() function, which is specifically designed
>> to add 32 bits to a 64-bit hash value.
> Ah, I missed that HASH_MIX() had 64 bit state, so much for being able to
> read it seems. Also; should we not move that entire section of
> fs/namei.c into linux/hash.h ?
> These two primitives seem generally useful.

Actually, the state is 2*sizeof(long), which is 128 bits on 64-bit.

I thought about moving it out to <linux/hash.h> as you suggest, but given
the tight coupling to the dcache hash, I decided not to until another
user showed up.

Remember, HASH_MIX() is *heavily* optimized for speed and just-barely-
adequate hash mixing for the dcache use case. Other users should think
carefully about using it.

In particular, it's designed for 32 bits of output. It does *not* achieve
full-width mixing, but rather achieves mixing to at least 32 bits of
output in the two rounds it has before cancellation can occur. If you
want 64 bits of hash, as in your application, it's kind of marginal.

>> A more thorough mixing would be achieved by __jhash_mix(). Basically:
>> static inline u64 iterate_chain_key(u64 key, u32 idx)
>> {
>> u32 k0 = key, k1 = key >> 32;
>> __jhash_mix(idx, k0, k1) /* Macro that modifies arguments! */
>> return k0 | (u64)k1 << 32;
>> }
>> (The order of arguments is chosen to perserve the two "most-hashed" values.)
> (I'd never have managed to deduce that property given the information in
> jhash.h)

The last line of __jhash_mix(a,b,c) is
c -= b; c ^= rol32(b, 4); b += a;

Thus, b and a are the last variables assigned to. If you had dropped
one of them and returned a instead, you'd have created dead code.

>> Also, I just had contact from the hppa folks who have brought to my
>> attention that it's an example of an out-of-order superscalar CPU that
>> *doesn't* have a good integer multiplier. For general multiplies,
>> you have to move values to the FPU and the code is a pain.
> Egads, that's horrible, but sounds exactly like the thing you 'like'
> given these patches :-) Good luck with that.

Well, low-level bit-twiddling can be kind of fun.

In this case, the level of effort was required to improve the hash
mixing from "embarrassingly bad" (did you *see* what it was before
0fed3ac866?) without adding delay to a scorchingly hot code path that
Linus watches like a hawk.