Re: Utilizing rdtsc on i586?

Harald Koenig (koenig@tat.physik.uni-tuebingen.de)
Wed, 30 Apr 1997 13:02:57 +0200


On Apr 29, Joe Fouche wrote:

> I made the following module to try out the Pentium cycle counter,
> but I believe it's broken. For one thing, it seems to cause random
> processes to segfault upon loading (at least, it did this before adding
> the cli()/sti() around the main code... not sure if it would do it
> anymore). For another, it returns 67 cycles
> for a for loop that [according to the assembly output] should take
> about 20. It returns around 5000 cycles if the index starts out as
> 1000. (I have a p100)

which gcc version and which optimisation options did you use ?

here is what I get for the user mode program below:

gcc-2.5.8 -O0 : 10000 loops -> 76895 cycles
gcc-2.5.8 -O1 : 10000 loops -> 20038 cycles
gcc-2.5.8 -O2 : 10000 loops -> 50020 cycles

gcc-2.7.2.1 -O0 : 10000 loops -> 76921 cycles
gcc-2.7.2.1 -O1 : 10000 loops -> 20050 cycles

I'm using a Pentium OverDrive PODP-83.

> Hope there's a kind soul out there willing to point out my mistakes,
> or why this has anomalous output.

no mistake, just Intel's broken(?) branch prediction for Pentium.
same as the "bogomips problem" for pentium; try the following
patch to "boost" your bogomips:

--- linux/include/asm-i386/delay.h~ Thu Jan 18 23:50:24 1996
+++ linux/include/asm-i386/delay.h Wed Apr 30 12:54:36 1997
@@ -14,7 +14,7 @@
extern __inline__ void __delay(int loops)
{
__asm__ __volatile__(
- ".align 2,0x90\n1:\tdecl %0\n\tjns 1b"
+ ".align 2,0x90\n1:\tdecl %0\n\tnop\n\tjns 1b"
:/* no outputs */
:"a" (loops)
:"ax");

> #ifdef MODULE
>
> int init_module(void)

it's not necessary to run this in kernel; using user mode is ok too:

void main(void)
{
unsigned int cnt3, cnt2, cnt1;
int i;
printf("cc: started... ");
__asm__(".byte 0x0f,0x31" : "=a" (cnt1), "=d" (cnt3));
/* put code to time here */
for(i=0;i<10000;i++);
/* end code to time here */
__asm__(".byte 0x0f,0x31" : "=a" (cnt2), "=d" (cnt3));
printf(" operation took %d cycles\n", cnt2-cnt1);
}

and here is the disassebled loop (using `disas main' in gdb) for those
binaries; `(bad)' is the timer opcode:

gcc-2.7.2 -O0 10000 loops -> 76921 cycles
0x8048487 <main+19>: (bad)
0x8048489 <main+21>: movl %eax,0xfffffff4(%ebp)
0x804848c <main+24>: movl %edx,0xfffffffc(%ebp)
0x804848f <main+27>: movl $0x0,0xfffffff0(%ebp)

0x8048496 <main+34>: cmpl $0x270f,0xfffffff0(%ebp)
0x804849d <main+41>: jle 0x80484a4 <main+48>
0x804849f <main+43>: jmp 0x80484ac <main+56>
0x80484a1 <main+45>: leal 0x0(%esi),%esi
0x80484a4 <main+48>: incl 0xfffffff0(%ebp)
0x80484a7 <main+51>: jmp 0x8048496 <main+34>
0x80484a9 <main+53>: leal 0x0(%esi),%esi
0x80484ac <main+56>: (bad)

gcc-2.7.2 -O1 10000 loops -> 20050 cycles
0x8048481 <main+13>: (bad)
0x8048483 <main+15>: movl %eax,%ecx
0x8048485 <main+17>: xorl %edx,%edx
0x8048487 <main+19>: addl $0x4,%esp
0x804848a <main+22>: leal (%esi),%esi

0x804848c <main+24>: incl %edx
0x804848d <main+25>: cmpl $0x270f,%edx
0x8048493 <main+31>: jle 0x804848c <main+24>
0x8048495 <main+33>: (bad)

gcc-2.5.8 -O0 10000 loops -> 76895 cycles
0x10c0 <main+24>: (bad)
0x10c2 <main+26>: movl %eax,0xfffffff4(%ebp)
0x10c5 <main+29>: movl %edx,0xfffffffc(%ebp)
0x10c8 <main+32>: movl $0x0,0xfffffff0(%ebp)

0x10cf <main+39>: cmpl $0x270f,0xfffffff0(%ebp)
0x10d6 <main+46>: jg 0x10e8 <main+64>
0x10d8 <main+48>: incl 0xfffffff0(%ebp)
0x10db <main+51>: jmp 0x10cf <main+39>
0x10dd <main+53>: nop
0x10de <main+54>: nop
0x10df <main+55>: nop
0x10e0 <main+56>: nop
0x10e1 <main+57>: nop
0x10e2 <main+58>: nop
0x10e3 <main+59>: nop
0x10e4 <main+60>: nop
0x10e5 <main+61>: nop
0x10e6 <main+62>: nop
0x10e7 <main+63>: nop
0x10e8 <main+64>: (bad)

gcc-2.5.8 -O1 10000 loops -> 20038 cycles
0x10ba <main+18>: (bad)
0x10bc <main+20>: movl %eax,%ecx
0x10be <main+22>: xorl %edx,%edx
0x10c0 <main+24>: addl $0x4,%esp
0x10c3 <main+27>: nop

0x10c4 <main+28>: incl %edx
0x10c5 <main+29>: cmpl $0x270f,%edx
0x10cb <main+35>: jle 0x10c4 <main+28>
0x10cd <main+37>: (bad)

gcc-2.5.8 -O2 10000 loops -> 50020 cycles
0x10ba <main+18>: (bad)
0x10bc <main+20>: movl %eax,%ecx
0x10be <main+22>: addl $0x4,%esp
0x10c1 <main+25>: movl $0x270f,%edx
0x10c6 <main+30>: nop
0x10c7 <main+31>: nop

0x10c8 <main+32>: decl %edx
0x10c9 <main+33>: jns 0x10c8 <main+32>
0x10cb <main+35>: (bad)

note that this 2 instruction loop (same as the udelay/bogomips loop)
is no good for Pentium; inserting a NOP would fix this...

Harald

--
All SCSI disks will from now on                     ___       _____
be required to send an email notice                0--,|    /OOOOOOO\
24 hours prior to complete hardware failure!      <_/  /  /OOOOOOOOOOO\
                                                    \  \/OOOOOOOOOOOOOOO\
                                                      \ OOOOOOOOOOOOOOOOO|//
Harald Koenig,                                         \/\/\/\/\/\/\/\/\/
Inst.f.Theoret.Astrophysik                              //  /     \\  \
koenig@tat.physik.uni-tuebingen.de                     ^^^^^       ^^^^^