Re: Utilizing rdtsc on i586?

Gabriel Paubert (
Fri, 2 May 1997 12:39:01 +0200 (METDST)

On Tue, 29 Apr 1997, Joe Fouche wrote:

> I made the following module to try out the Pentium cycle counter,
> but I believe it's broken. For one thing, it seems to cause random
> processes to segfault upon loading (at least, it did this before adding
> the cli()/sti() around the main code... not sure if it would do it
> anymore). For another, it returns 67 cycles
> for a for loop that [according to the assembly output] should take
> about 20. It returns around 5000 cycles if the index starts out as
> 1000. (I have a p100)
> Hope there's a kind soul out there willing to point out my mistakes,
> or why this has anomalous output.

This output is not anomalous, first you should take into account that RDTSC
itself takes a significant number of clock cycles. I ran your program in user
mode on a P133 after modifying it like this (note that I don't use cli/sti,
if the results are obviously too large, I simply rerun it):

#include <stdio.h>
int main()
unsigned int cnt3, cnt2[2], cnt1[2];
int i, j;
printf("cc: started... ");
for(j=0; j<2; j++){
__asm__(".byte 0x0f,0x31" : "=a" (cnt1[j]), "=d" (cnt3));
/* put code to time here */
/* end code to time here */
__asm__(".byte 0x0f,0x31" : "=a" (cnt2[j]), "=d" (cnt3));
printf(" execution took %d then %d cycles\n",
cnt2[0]-cnt1[0], cnt2[1]-cnt1[1]);
return 0;

If I comment out the loop, I get 13 clock cycles on both executions, it
should be somewhat faster at ring 0, but still take 8 clocks. That't the
bare minimum overhead between two rdtsc, with an instruction in between to
save the result of the first.

If I run the loop 10 times, I get 53/38 clocks, which are actually 40/25
when taking into account the RDTSC overhead. On the first iteration
there are significant additional penalties because the code is
being loaded into the cache and in this case processor is limited both
by bus (the time it takes to get instructions from main memmory) and decode
bandwidth. x86 instructions can be between 1 and 15 bytes long, which
makes decoding of two instructions in parallel extremely complex.
Actually on the first iteration, the decoder finds the length of the next
instruction to execute and then writes additional bits into the cache
to mark instruction boundaries, but only exceptionally decodes and executes
two instructions in parallel.

Then you have branch prediction problems: this is decribed in detail at

and the conclusion I would draw is that branch prediction only settles
correctly at the 3rd iteration of the loop. Why:
- on first iteration, branch will go to U pipe, misprediction
- on second it will go to V, misprediction but BTB correctly written
- on third prediction will finally be correct
- on last iteration, unavoidable misprediction

that's a 30 percent misprediction rate for 10 iterations, which
account for 11 cycles. On the second excution, there will be only
the unavoidable 4 clocks misprediction when exiting the loop. Adding the
loop initialization accounts for the 25 cycles.

Note that if I change the loop count to 20, I get 73/58, exactly 20 more
than with 10 on both executions, which shows that additional iterations take
the expected two clocks.

Conclusion, that your processor works as expected! And that innermost loops
which are executed between 3 and 10 times are very bad for branch prediction.