As a prelimenary step to implementing fast (and free) integer division
routines, I made some performance measurements. If you know the Alpha
architecture, you may know that the EV4 based CPUs have a relatively
slow integer multiplier (23 cycles max latency) and a slightly slower
floating-point divider (34 cycles max for single precision, 63 cycles
max for double prec.). So the tradeoff is between using the integer
multiplier or the floating point divider to implement integer
division. Neal Crook mentioned in his note on Alpha systems (see
question 17 in the Linux/Alpha FAQ at
http://www.azstarnet.com/~axplinux) that conversions between integer
and floating point registers is relatively slow because the conversion
has to be done trough a store/load pair (i.e., through the memory
system). Thus, it is not entirely clear whether:
2 integer->float conversions + 1 fp divide + 1 float->integer conversion
is any faster than a division routine using the integer multiplier and
table-lookups.
Measurements yield the following (all times in CPU cycles with warm
cache):
(a) (b) (c)
operation: int mult: float div: double div:
--------------- ---------- ---------- -----------
35/5 95 123 152
914292736/13951 160 123 152
Case (a) uses the integer division routine of OSF/1 v3.0, which is
well optimized, as far as I know. It uses table-lookups and the
integer multiplier. Case (b) first converts the integers to floats,
performs the fp single-precision division, then converts the result
back to an integer. Case (c), finally, is the same as (b) except that
it uses double precision.
The table shows that for small divides, (a) is faster than either (b)
or (c). When dividing larger numbers, things look a little better:
(b) is slightly faster and (c) is about as slow as (a). This,
unfortunately, isn't good enough because the floating point division
approach can be used only if the integers aren't too big (e.g., less
than 55 bits). So, to be fair, one would have to add these checking
overheads to (b) and (c) and, quite likely, (a) would win again.
Not happy with the results, I looked at the assembly code for (b) and
(c) and found the following (the rpcc instructions read the cycle
counter register):
[tdiv.c: 149] 0x120001ac4: 603fc000 rpcc t0
[tdiv.c: 151] 0x120001ac8: b71e0048 stq t10, 72(sp) **
[tdiv.c: 151] 0x120001acc: 8d7e0048 ldt $f11, 72(sp) **
[tdiv.c: 153] 0x120001ad0: b73e0048 stq t11, 72(sp)
[tdiv.c: 151] 0x120001ad4: 5beb1781 cvtqs $f11,$f1
[tdiv.c: 153] 0x120001ad8: 8d7e0048 ldt $f11, 72(sp) **
[tdiv.c: 153] 0x120001adc: 5beb178a cvtqs $f11,$f10
[tdiv.c: 155] 0x120001ae0: 582a1061 divs $f1,$f10,$f1 **
[tdiv.c: 156] 0x120001ae4: 5be105e1 cvttqc $f1,$f1
[tdiv.c: 156] 0x120001ae8: 9c3e0048 stt $f1, 72(sp)
[tdiv.c: 150] 0x120001aec: 403f0002 addl t0, zero, t1
[tdiv.c: 156] 0x120001af0: a71e0048 ldq t10, 72(sp)
[tdiv.c: 157] 0x120001af4: 603fc000 rpcc t0
I then forced gcc to generate the following code instead:
[tdiv.c: 149] 0x120001ac4: 603fc000 rpcc t0
[tdiv.c: 151] 0x120001ac8: b71e0048 stq t10, 72(sp) **
[tdiv.c: 151] 0x120001acc: 8d7e0048 ldt $f11, 72(sp) **
[tdiv.c: 150] 0x120001ad0: 403f0002 addl t0, zero, t1
[tdiv.c: 151] 0x120001ad4: 5beb1781 cvtqs $f11,$f1
[tdiv.c: 153] 0x120001ad8: b73e0048 stq t11, 72(sp) **
[tdiv.c: 153] 0x120001adc: 8d7e0048 ldt $f11, 72(sp) **
[tdiv.c: 153] 0x120001ae0: 5beb178a cvtqs $f11,$f10
[tdiv.c: 155] 0x120001ae4: 582a1061 divs $f1,$f10,$f1
[tdiv.c: 156] 0x120001ae8: 5be105e1 cvttqc $f1,$f1
[tdiv.c: 156] 0x120001aec: 9c3e0048 stt $f1, 72(sp)
[tdiv.c: 156] 0x120001af0: a71e0048 ldq t10, 72(sp)
[tdiv.c: 157] 0x120001af4: 603fc000 rpcc t0
Notice that the only important change is that now both integer->double
conversions have a store/load instruction pair back to back (these
instructions are marked by a pair of asterisks). According to the
book, this is a Really Stupid Thing (TM) to do. But not so
fast---look at the numbers now:
(a) (b) (c)
operation: int mult: float div: double div:
--------------- ---------- ---------- -----------
35/5 95 66 95
914292736/13951 160 66 95
Surprise, surprise---suddenly both case (b) and (c) are 57 cycles
faster than before! The above measurements were done on a DEC
3000/600, which has a pretty darn good memory system. On my
Cabriolet, the difference is about 59 cycles, on the Noname the
difference is 89 cycles.
How could it be that a store directly followed by a load is so much
faster than a "properly scheduled" instruction sequence? Well, the
EV4 CPU has a bypass path that allows to read data directly from the
write buffer (which is 4 cache blocks deep). My guess is that in the
former sequence, the load is able to read the data from the write
buffer, while in the latter sequence the load misses the write buffer
and has to go to the bcache instead (this is just a guess
though---awfully hard to say exactly what's going on in the CPU).
(Here comes the Executive Summary:)
What are the conclusions to draw from this?
(1) Thanks to the write-buffer bypass, a back-to-back scheduled
stq/ldt or stt/ldq can execute at CPU speed, rather than
at bcache speed; thus, such instruction pairs should usually be
scheduled back-to-back. It seems like it might even be worthwhile
teaching gcc about this. Richard, do you have any comments?
(2) At least for the EV4 CPU, floating point division seems to be
a viable alternative to using the integer multiplier to implement
integer division (at least in the case where the mantissa is large
enough to hold the dividend and divisor). This is particularly
true in the light of the widening gap between CPU and memory speed.
Genuine memory references (such as table lookups) can be extremely
expensive when they miss in the cache (notice that all of the above
measurements were done with warm caches, which favors method (a)
considerably).
If anybody has any comments or if somebody could add some EV5
measurements, I'd greatly appreciate it (just holler if you want the
source code of the test program).
Neal, the 21064A seems to be about 39 cycles faster in performing a
double precision divide compared to a 21064 (or 21066). Is this due
to the FPU improvements you mentioned in your note?
--david