Re: [PATCH v2] tile: avoid using clocksource_cyc2ns with absolute cycle count

From: Chris Metcalf
Date: Thu Nov 17 2016 - 15:16:29 EST

On 11/17/2016 4:53 AM, Peter Zijlstra wrote:
On Wed, Nov 16, 2016 at 03:16:59PM -0500, Chris Metcalf wrote:
PeterZ (cc'ed) then improved it to use __int128 math via
mul_u64_u32_shr(), but that doesn't help tile; we only do one multiply
instead of two, but the multiply is handled by an out-of-line call to
__multi3, and the sched_clock() function ends up about 2.5x slower as
a result.
Well, only if you set CONFIG_ARCH_SUPPORTS_INT128, otherwise it reduces
to 2 32x23->64 multiplications, of which one if conditional on there
actually being bits set in the high word of the u64 argument.

I didn't notice that. It took me down an interesting rathole.

Obviously the branch optimization won't help on cycle counter values,
since we blow out of the low 32 bits in the first few seconds of
uptime. So the conditional test won't help, but the 32x32
multiply optimizations should.

However, I was surprised to discover that the compiler doesn't always
catch the 32x32 case. It does for simple cases on gcc 4.4, but if
you change the compiler version or the complexity of the code, it can
lose sight of the optimization opportunity, and in fact that happens
in mul_u64_u32_shr(), and we get 64x64 multiplies. I passed this
along to our compiler team as an optimization bug.

Given that, it turns out it's always faster to do the unconditional
path on tilegx. The basic machine instruction is a 32x32
multiply-accumulate, but unlike most tilegx instructions, it causes a
1-cycle RAW hazard stall if you try to use the result in the next
instruction. Similarly, mispredicted branches take a 1-cycle stall.
The unconditional code pipelines the multiplies completely and runs in
10 cycles; the conditional code has two RAW hazard stalls and a branch
stall, so it takes 12 cycles even when it skips the second multiply.

Working around the missed compiler optimization by taking the existing
mul_u64_u32_shr() and replacing "*" with calls to __insn_mul_lu_lu()
to use the compiler builtin gives a 10-cycle function (assuming we
have to do both multiplies). So this is the same performance as the
pipelined mult_frac() that does the overlapping 64x64 multiplies.

We can do better by providing a completely hand-rolled version of the
function, either using "*" if the compiler optimization is fixed, or
__insn_mul_lu_lu() if it isn't, that doesn't do a conditional branch:

static inline u64 mul_u64_u32_shr(u64 a, u64 mul, unsigned int shift)
return (__insn_mul_lu_lu(a, mul) >> shift) +
(__insn_mul_lu_lu(a >> 32, mul) << (32 - shift));

This compiles down to 5 cycles with no hazard stalls. It's not
completely clear where I'd put this to override the <linux/math64.h>
version; presumably in <asm/div64.h>? Of course I'd then also have to
make it conditional on __tilegx__, since tilepro has a different set
of multiply instructions, as it's an ILP32 ISA.

I'm a little dubious that it's worth the investment in build
scaffolding to do this to save 5 cycles, so I think for now I will
just keep the mult_frac() version, and push it to stable to fix the
bug with the cycle counter overflowing. Depending on
what/when I hear back from the compiler team, I will think about
saving those few extra cycles with a custom mul_u64_u32_shr().

Chris Metcalf, Mellanox Technologies