Re: bypass theory: nice try, but no cigar.

David Mosberger-Tang (davidm@AZStarNet.com)
Fri, 1 Sep 1995 09:52:04 -0700


>>>>> On Fri, 1 Sep 95 17:46:42 MET DST, "Neal." <crook@rdgeng.enet.dec.com> said:

Neal> Well, I thought this sounded fishy. My recollection was that a
Neal> miss would cause the write buffer to *flush* (empty
Neal> completely) then honour the read miss from external
Neal> memory/Bcache. I checked the hardware reference manual and it
Neal> says:

Well, I did say it was a guess... :)

Neal> So, the question remains: just exactly *what is* going on?

I tried to track it down a little further and measured three simpler
cases:

(a) stq/ldt alone
(b) a pair of stq/ldt, back to back
(c) same as (b), but the pairs are isolated by a "copy-sign"
(cpys) instruction

Here are the numbers as obtained on a Cabriolet (units is number of
CPU cycles; each case was run 10 times; first execution time for each
case is higher due to cache effects):

(a): 13 5 5 5 5 5 5 5 5 5
(b): 73 53 51 51 51 51 51 51 51 41
(c): 65 15 15 15 15 15 15 15 15 15

For my Noname, the numbers are the same, except that the time for (b)
goes up to 68 cycles.

So, how in the world can this be? Comparing the Cabriolet and Noname
results, it seems that cases (a) and (c) are independent of the
b-cache system. Is it really possible to flush the write-buffers,
read a cache-line into the d-cache and return the desired quadword in
5 cycles? Not as far as I know. Just to be sure, here the exact
assembly code for the three cases:

(a)
0000000120000314 <main+44> 6020c000 18 rpcc t0
0000000120000318 <main+48> 403f0003 10 addl t0, zero, t2
000000012000031c <main+4c> b45e0048 2d stq t1, 72(sp)
0000000120000320 <main+50> 8c3e0048 23 ldt f1, 72(sp)
0000000120000324 <main+54> 6020c000 18 rpcc t0

(b)
00000001200003b4 <main+e4> 6020c000 18 rpcc t0
00000001200003b8 <main+e8> 403f0003 10 addl t0, zero, t2
00000001200003bc <main+ec> b45e0048 2d stq t1, 72(sp)
00000001200003c0 <main+f0> 8c3e0048 23 ldt f1, 72(sp)
00000001200003c4 <main+f4> b45e0050 2d stq t1, 80(sp)
00000001200003c8 <main+f8> 8c3e0050 23 ldt f1, 80(sp)
00000001200003cc <main+fc> 6020c000 18 rpcc t0

(c)
0000000120000474 <main+1a4> 6020c000 18 rpcc t0
0000000120000478 <main+1a8> 403f0003 10 addl t0, zero, t2
000000012000047c <main+1ac> b45e0048 2d stq t1, 72(sp)
0000000120000480 <main+1b0> 8c3e0048 23 ldt f1, 72(sp)
0000000120000484 <main+1b4> 5fe1041f 17 cpys f31, f1, f31
0000000120000488 <main+1b8> b45e0050 2d stq t1, 80(sp)
000000012000048c <main+1bc> 8c3e0050 23 ldt f1, 80(sp)
0000000120000490 <main+1c0> 6020c000 18 rpcc t0

Notice: in (c), it is important that the "cpys" instruction has a
data-dependency on the previous instruction. E.g., if I replace it
with "cpys f31,f31,f31", the execution time jumps back up to around 50
cycles).

Obviously, I must be doing something wrong. So, I appended the test
program to this mail. It should compile with any gcc, both under
OSF/1 and Linux. If somebody could enlighten me, I'd greatly
appreciate it.

--david

---
#include <stdio.h>

static inline unsigned int read_itimer() { unsigned long r;

asm volatile("rpcc %0" : "=r"(r) :: "memory"); /* read the process cycle counter */ return r; /* return lower 32 bits */ }

void main(int argc, char ** argv) { long l, t1, t2; double d; int i; unsigned int start, stop; unsigned int times[10];

printf("ireg->fpreg:\t"); for (i = 0; i < 10; ++i) { asm volatile ("bis $31,$31,%0" : "r="(l)); start = read_itimer(); asm volatile ("stq %2,%0 ldt %1,%0" : "m="(t1), "f="(d) : "r"(l)); stop = read_itimer(); asm volatile ("# %0" :: "f"(d)); times[i] = stop - start; } for (i = 0; i < 10; ++i) { printf(" %u", times[i]); } printf("\n");

printf("ireg->fpreg;ireg->fpreg:\t"); for (i = 0; i < 10; ++i) { asm volatile ("bis $31,$31,%0" : "r="(l)); start = read_itimer(); asm volatile ("stq %3,%0 ldt %2,%0 stq %3,%1 ldt %2,%1" : "m="(t1), "m="(t2), "f="(d) : "r"(l)); stop = read_itimer(); asm volatile ("# %0" :: "f"(d)); times[i] = stop - start; } for (i = 0; i < 10; ++i) { printf(" %u", times[i]); } printf("\n");

printf("ireg->fpreg;cvtqt;ireg->fpreg:\t"); for (i = 0; i < 10; ++i) { asm volatile ("bis $31,$31,%0" : "r="(l)); start = read_itimer(); asm volatile ("stq %3,%0 ldt %2,%0 cpys $f31,%2,$f31 stq %3,%1 ldt %2,%1" : "m="(t1), "m="(t2), "f="(d) : "r"(l)); stop = read_itimer(); asm volatile ("# %0" :: "f"(d)); times[i] = stop - start; } for (i = 0; i < 10; ++i) { printf(" %u", times[i]); } printf("\n"); }