Re: [RFC][PATCH 0/3] gcc work-around and math128

From: Andy Lutomirski
Date: Tue Apr 24 2012 - 17:36:11 EST


On Tue, Apr 24, 2012 at 2:32 PM, Peter Zijlstra <a.p.zijlstra@xxxxxxxxx> wrote:
> On Tue, 2012-04-24 at 14:15 -0700, Andy Lutomirski wrote:
>> > The second two implement a few u128 operations so we can do 128bit math.. I
>> > know a few people will die a little inside, but having nanosecond granularity
>> > time accounting leads to very big numbers very quickly and when you need to
>> > multiply them 64bit really isn't that much.
>>
>> I played with some of this stuff awhile ago, and for timekeeping, it
>> seemed like a 64x32->96 bit multiply followed by a right shift was
>> enough, and that operation is a lot faster on 32-bit architectures than
>> a full 64x64->128 multiply.
>
> The SCHED_DEADLINE use case is not that, it multiplies two time
> intervals. Basically it needs to evaluate if a task activation still
> fits in the old period or if it needs to shift the deadline and start a
> new period.
>
> It needs to do: runtime / (deadline - t) < budget / period
> which transforms into: (deadline - t) * period < budget * runtime
>
> hence the 64x64->128 mult and 128 compare.

Fair enough.

>
>> Something like:
>>
>> uint64_t mul_64_32_shift(uint64_t a, uint32_t mult, uint32_t shift)
>> {
>>   return (uint64_t)( ((__uint128_t)a * (__uint128_t)mult) >> shift );
>> }
>
> That looks a lot like what we grew mult_frac() for, it does:
>
> /*
>  * Multiplies an integer by a fraction, while avoiding unnecessary
>  * overflow or loss of precision.
>  */
> #define mult_frac(x, numer, denom)(                     \
> {                                                       \
>        typeof(x) quot = (x) / (denom);                 \
>        typeof(x) rem  = (x) % (denom);                 \
>        (quot * (numer)) + ((rem * (numer)) / (denom)); \
> }                                                       \
> )
>
>
> and is used in __cycles_2_ns() and friends.

Yeesh. That looks way slower, and IIRC __cycles_2_ns overflows every
few seconds on modern machines.

gcc 4.6 generates this code:

mul_64_32_shift:
pushq %rbp
movq %rsp, %rbp
movl %edx, %ecx
movl %esi, %eax
mulq %rdi
movq %rdx, %rsi
shrq %cl, %rsi
shrdq %cl, %rdx, %rax
testb $64, %cl
cmovneq %rsi, %rax
popq %rbp
ret

which is a bit dumb if you can make assumptions about the shift. See
http://gcc.gnu.org/PR46514. Some use cases might be able to guarantee
that the shift is less than 32 bits, in which case hand-written
assembly would be a few cycles faster.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/