Re: [RFC][PATCH 0/3] gcc work-around and math128

From: Andy Lutomirski
Date: Tue Apr 24 2012 - 17:36:11 EST

Next message: Glauber Costa: "Re: [PATCH 17/23] kmem controller charge/uncharge infrastructure"
Previous message: Greg KH: "Re: linux-next: allyesconfig build failure in drivers/usb/gadget"
In reply to: Peter Zijlstra: "Re: [RFC][PATCH 0/3] gcc work-around and math128"
Next in thread: Peter Zijlstra: "Re: [RFC][PATCH 0/3] gcc work-around and math128"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Tue, Apr 24, 2012 at 2:32 PM, Peter Zijlstra <a.p.zijlstra@xxxxxxxxx> wrote:
> On Tue, 2012-04-24 at 14:15 -0700, Andy Lutomirski wrote:
>> > The second two implement a few u128 operations so we can do 128bit math.. I
>> > know a few people will die a little inside, but having nanosecond granularity
>> > time accounting leads to very big numbers very quickly and when you need to
>> > multiply them 64bit really isn't that much.
>>
>> I played with some of this stuff awhile ago, and for timekeeping, it
>> seemed like a 64x32->96 bit multiply followed by a right shift was
>> enough, and that operation is a lot faster on 32-bit architectures than
>> a full 64x64->128 multiply.
>
> The SCHED_DEADLINE use case is not that, it multiplies two time
> intervals. Basically it needs to evaluate if a task activation still
> fits in the old period or if it needs to shift the deadline and start a
> new period.
>
> It needs to do: runtime / (deadline - t) < budget / period
> which transforms into: (deadline - t) * period < budget * runtime
>
> hence the 64x64->128 mult and 128 compare.

Fair enough.

>
>> Something like:
>>
>> uint64_t mul_64_32_shift(uint64_t a, uint32_t mult, uint32_t shift)
>> {
>> return (uint64_t)( ((__uint128_t)a * (__uint128_t)mult) >> shift );
>> }
>
> That looks a lot like what we grew mult_frac() for, it does:
>
> /*
> * Multiplies an integer by a fraction, while avoiding unnecessary
> * overflow or loss of precision.
> */
> #define mult_frac(x, numer, denom)( \
> { \
> typeof(x) quot = (x) / (denom); \
> typeof(x) rem = (x) % (denom); \
> (quot * (numer)) + ((rem * (numer)) / (denom)); \
> } \
> )
>
>
> and is used in __cycles_2_ns() and friends.

Yeesh. That looks way slower, and IIRC __cycles_2_ns overflows every
few seconds on modern machines.

gcc 4.6 generates this code:

mul_64_32_shift:
pushq %rbp
movq %rsp, %rbp
movl %edx, %ecx
movl %esi, %eax
mulq %rdi
movq %rdx, %rsi
shrq %cl, %rsi
shrdq %cl, %rdx, %rax
testb $64, %cl
cmovneq %rsi, %rax
popq %rbp
ret

which is a bit dumb if you can make assumptions about the shift. See
http://gcc.gnu.org/PR46514. Some use cases might be able to guarantee
that the shift is less than 32 bits, in which case hand-written
assembly would be a few cycles faster.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Glauber Costa: "Re: [PATCH 17/23] kmem controller charge/uncharge infrastructure"
Previous message: Greg KH: "Re: linux-next: allyesconfig build failure in drivers/usb/gadget"
In reply to: Peter Zijlstra: "Re: [RFC][PATCH 0/3] gcc work-around and math128"
Next in thread: Peter Zijlstra: "Re: [RFC][PATCH 0/3] gcc work-around and math128"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]