Re: Winmodem support, some performance tradeoff estimates

Oliver Xymoron (oxymoron@waste.org)
Mon, 17 Aug 1998 10:11:44 -0500 (CDT)


On Mon, 17 Aug 1998, Thomas Sailer wrote:

> Oliver Xymoron wrote:
>
> > An unrolled multiply accumulate _can_ be done in 2 clocks per argument on
> > a Pentium, however (hint: the fxchg instruction can be made to take 0(!!)
> > clocks if ordered properly). I put together a signal processing app that
> > did dot products at 45 mflops on a P90 last year. But this was only if its
> > working set fit within the L1 cache.
>
> Hm? Let's see: Add throughput is 1 per cycle, Mul throughput is 1 per
> cycle, but when do you fetch the arguments from L1 cache? Or are they
> already in registers when you start your algorithm? Care to post your
> actual code?

Hmmm... (digging around for code).. Oops, I did indeed forget about the
fld's. The 45 mflops number I did remember right, I was just originally
counting each multiply and add as a flop. Or, rather, each operand as a
flop. If you're still curious, the innermost loop looks something like
this (forgive the Intel syntax):

loop2:
fld dword ptr [eax]
fmul dword ptr [ebx]
fld dword ptr [eax+4]
fmul dword ptr [ebx+4]
fxch st(2)
faddp st(1), st
fld dword ptr [eax+8]
fmul dword ptr [ebx+8]
fxch st(2)
faddp st(1), st
fld dword ptr [eax+12]
fmul dword ptr [ebx+12]
fxch st(2)
faddp st(1), st
add eax, 16
add ebx, 16
faddp st(1), st
dec ecx
jnz loop2

It took a couple days of experimenting to find code that pipelined as well
as the above, which is about 4 times faster than what any of the C
compilers on hand came up with.

--
 "Love the dolphins," she advised him. "Write by W.A.S.T.E.." 

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.altern.org/andrebalsa/doc/lkml-faq.html