>> >>>>> On 10 Dec 1995 16:48:36 GMT, phys6k@menudo.uh.edu
>> (Yan-Song Chen) said:
Yan-Song> I am looking for a small machine with good FP performance,
Yan-Song> I don't care the software stuff, since the only software I
Yan-Song> use is gcc. I think the intel is better for integer
Yan-Song> performance and x86 compability which means more
Yan-Song> software. But finally I find that I have to go with intel
Yan-Song> only for floating point performance. Seems the alpha does
Yan-Song> not have any advantage if your budget is less than 20k.
OK, I ran your code on a number of different machines in different
configurations. Here is the summary:
with
current using
Linux OSF/1
libm libm OS compiler
[secs] [secs]
Alpha 21064A 275MHz (Cabriolet): 222.1 43.2 Linux 1.3.45 gcc 2.7.1
Alpha 21064A 233MHz (AS 200): 51.2 OSF/1 v3.x cc -migrate -O4
Alpha 21064A 233MHz (AS 200): 48.8 OSF/1 v3.x gcc 2.6.3
Pentium 90/100 (Intel system): 109.5 Linux 1.3.x gcc 2.6.3
Alpha 21066 166MHz (Noname): 706.6 214.5 Linux 1.3.45 gcc 2.7.1
All machines had at least 16MB of memory and memory demand of
the test program is small.
First off, why are things so slow on the Alpha when using libm that
comes with libc-linux-0.39? Well, libm hasn't been optimized at all
and it turns out that the test program spends almost all of its time
in sqrt() which turns out to be about as badly implemented as it could
be (it computes one bit at a time!). Just to verify this point, I
replaced it with a quick & dirty sqrt algorithm that's a little more
sane (but doesn't care to get all those inexact and lsb things right).
Just by replacing sqrt(), time on my Cabriolet dropped down from 222.1
to 79.04 seconds. How is that for an improvement? :) I don't have the
mathematical background (or the desire, for that matter), to optimize
libm, but it sure looks like there is a lot to be gained by spending a
little time on it (hint, hint... :)
Second, the Noname is doing really quite bad on this testprogram.
It's almost 5 times slower than the Cabriolet that I have. The clock
rate accounts only for a factor of 1.65. So there still is a factor 3
slow that needs to be explained. I'm not certain what it is but the
Cabriolet has both bigger primary caches (16KB vs. 8KB) and the
secondary cache is also a lot bigger (2MB vs 256KB). So my *guess* is
that for some reason the CPU simply doesn't get to execute because it
spends all the time waiting for memory accesses to complete. For
example, it may be that the program's access pattern is such that with
the given cache sizes there are a lot of conflict misses. Therefore,
it would be interesting to see how the Noname would do on a machine
with a 1MB cache and/or with a 233MHz CPU (that CPU also has the 16KB
primary caches). Any takers?
The lessons I take from this little exeriment:
(a) your milage may vary---measure what you care about
(b) for this particular program, gcc actually does a
better job than cc -O4 -migrate (quite a surprise!)
(c) there is a lot of room for optimization in libm (and
libc probably as well---don't hesitate to contribute)
(d) Linux doesn't seem to get in the way of the test program,
as one would expect/hope.
Enjoy,
--david