Re: alpha slower than pentium II

Yan-Song Chen (YChen@UH.EDU)
Sat, 23 Dec 1995 14:25:46 -0600 (CST)


> Yan-Song Chen: "Re: alpha slower than pentium II" (Dec 21, 13:21):
> > > I have a little bit of mathematical background and I would suggest using
> > > a Newton algorithm to compute the sqrt of a:
> > > Just find the positive root of the function x^2-a by iterating
> > >
> > > x_{n+1} = 1/2(x_n + a/x_n)
>
> This is what I used at first, but performance is not optimal, due to the
> division. It was also non-trivial to get the correct last-bit rounding
> done.
>
> > based on the algorithm you suggested, I write another dirty sqrt function.
> > I modified your formula a little bit:
> > let r_0 = a+1, s_0 = 2
> >
> > r_{n+1} = r_n*r_n + a * b_n * b_n;
> > s_{n+1} = 2 * r_n * b_n;
> >
> > x_n = r_n/s_n;
> > Now I only need to run the division once at the end of iteration.
> > But my code still runs 30% slower than osf/1 libm
>
> Have you checked your results? Even a last-bit rounding error is a
> no-no, and the above does not look stable in that respect. I also
> suspect it overflows _very_ easily.
Hi,guys, Merry Christmas and Happy New Year! :-)
My implementation is a little bit difference, I let 1 <= a < 2, and the
algorithm can get almost correct result in at most 4 iterations. (only
the last-bit has rounding error) for the a >= 2 or a < 1, I use the
integer operation to deal with the exponential factor. Yeah, the last-bit
rounding error is still a problem, but with this algorithm, we can at
least get comparable performance with osf/1's libm.
>
> This is a simple test-program that can be used to check how well you're
> doing the square root:
>
> #include <math.h>
>
> unsigned long _a = 0x7fefffffffffffff; /* max-double bit representation */
> #define a (*(double *)&_a)
>
> int main(int argc, char ** argv)
> {
> double x, sum = 0.0;
>
> printf("%016lx\n", _a);
> printf("%g\n", a);
> a = sqrt(a);
> printf("%016lx\n", _a);
> printf("%g\n", a);
> x = 10000000.0;
> while (x < 10001000.0) {
> double error = sqrt(x);
> error = fabs(x - error*error);
> sum += error;
> x += 0.001;
> }
> printf("%g\n", sum);
> }
>
> Now, the above _should_ print out
>
> 7fefffffffffffff
> 1.79769e+308
> 5fefffffffffffff
> 1.34078e+154
> 0.000655638
>
> and if it doesn't, your sqrt() doesn't work.. (note - this is just a
> very stupid program I used to test performance and accuracy: it's by no
> means proof of anything, but it's a good way to do the first tests).
Let me test my sqrt() with your program after christmas.:) Btw, I have found
the the c compiler insert trapb instruction after every arithmetic
instruction, is that absolutely necessary? I think it will hurt the
performance of math function a lot.
>
> > Don't know how can the osf/1 libm calculat sqrt, sb. said it does
> > not need division, anybody know the algorithm?
>
> I posted this earlier, but maybe it got lost. It's not as fast as the
> OSF/1 routines, but the difference is not large, and as far as I can
> till it handles all the special cases and rounding correctly. No
> guarantees.
>
> I didn't come up with the algorithm, I just coded it..
>
> Linus
>
> ----------
> /* Copyright 1995 Linus Torvalds */
> #include <errno.h>
>
> static int T2[64]= {
> 0x1500, 0x2ef8, 0x4d67, 0x6b02, 0x87be, 0xa395, 0xbe7a, 0xd866,
> 0xf14a, 0x1091b,0x11fcd,0x13552,0x14999,0x15c98,0x16e34,0x17e5f,
> 0x18d03,0x19a01,0x1a545,0x1ae8a,0x1b5c4,0x1bb01,0x1bfde,0x1c28d,
> 0x1c2de,0x1c0db,0x1ba73,0x1b11c,0x1a4b5,0x1953d,0x18266,0x16be0,
> 0x1683e,0x179d8,0x18a4d,0x19992,0x1a789,0x1b445,0x1bf61,0x1c989,
> 0x1d16d,0x1d77b,0x1dddf,0x1e2ad,0x1e5bf,0x1e6e8,0x1e654,0x1e3cd,
> 0x1df2a,0x1d635,0x1cb16,0x1be2c,0x1ae4e,0x19bde,0x1868e,0x16e2e,
> 0x1527f,0x1334a,0x11051,0xe951, 0xbe01, 0x8e0d, 0x5924, 0x1edd
> };
>
> #define lobits(x) (((unsigned int *)&x)[0])
> #define hibits(x) (((unsigned int *)&x)[1])
>
> static inline double initial_guess(double x, unsigned int k)
> {
> double ret = 0.0;
>
> k = 0x5fe80000 - (k >> 1);
> k = k - T2[63&(k>>14)];
> hibits(ret) = k;
> return ret;
> }
>
> #define two_to_minus_5 (0.5*0.5*0.5*0.5*0.5)
> #define two_to_minus_10 (two_to_minus_5*two_to_minus_5)
> #define two_to_minus_30 (two_to_minus_10*two_to_minus_10*two_to_minus_10)
>
> /* up = nextafter(1,+Inf), dn = nextafter(1,-Inf) */
> static unsigned long __up = 0x3ff0000000000001;
> static unsigned long __dn = 0x3fefffffffffffff;
>
> #define dn (*(double *)&__dn)
> #define up (*(double *)&__up)
>
> /* Multiply with chopping rounding.. */
> static inline double choppedmul(double a, double b)
> {
> __asm__("multc %1,%2,%0"
> :"=f" (a)
> :"f" (a), "f" (b));
> return a;
> }
>
> double sqrt(double x)
> {
> unsigned long k;
> double y, z, zp, zn;
>
> *(double *)&k = x;
>
> /* Negative or NaN or Inf */
> if ((k >> 52) >= 0x7ff)
> goto special;
> y = initial_guess(x, k >> 32);
> y = y*(1.5 - 0.5*x*y*y);
> y = y*((1.5 - two_to_minus_30) - 0.5*x*y*y);
> z = x*y;
> z = z + 0.5*z*(1-z*y);
> zp = choppedmul(z,dn);
> zn = choppedmul(z,up);
> y = choppedmul(z, zp) - x;
> x = choppedmul(z, zn) - x;
> /* I can't get gcc to use fcmov's.. */
> __asm__("fcmovge %2,%3,%0"
> :"=f" (z)
> :"0" (z), "f" (y), "f" (zp));
> __asm__("fcmovlt %2,%3,%0"
> :"=f" (z)
> :"0" (z), "f" (x), "f" (zn));
> return z;
> special:
> /* throw away sign bit */
> k <<= 1;
> /* -0 */
> if (!k)
> return x;
> /* special? */
> if ((k >> 53) == 0x7ff) {
> /* NaN? */
> if (k << 11)
> return x;
> /* sqrt(+Inf) = +Inf */
> if (x > 0)
> return x;
> }
> /* -ve or -Inf */
> errno = EDOM;
> k = ~0UL;
> return *(double *)&k;
> }
> ----------
>

-- 
yansong chen
--
YChen@uh.edu