CAUGHT IN THE ACT: evidence of clock skew on Intel SMP

Colin Plumb (colin@nyx.net)
Fri, 4 Dec 1998 23:18:32 -0700 (MST)


Thanks to everyone who has been sending me SMP skew-testing results.
I have some cleaner V2 code below which tries to reduce some sources
of skew in the code by forcing both master and slave to use the
same message-seding code, but I have found a smoking gun in a
dual PPro 180 MHz that someone sent me the results from:

> -8960 9522
> -9098 9393
> -9092 9390
> -9092 9390
> -9098 9393
> -9092 9390
> -9092 9393
> -9092 9390
> -9092 9456
> -9089 9390
> -9092 9390
> -9092 9393
> -9098 9390
> -9092 9390
> -9083 9390
> -9092 9390
> -9092 9456
> -9089 9390
> -9092 9390
> -9092 9393
> -9098 9390
> -9092 9390
> -9089 9390
> -9092 9390
> -9092 9456
> min1 = -9098, min2 = 9390, skew = -18488

These two processors have their TSCs offset by 9244 from each other, a
difference of 51 usec. That is definitely enough to screw up timekeeping.

Thus, I think we need to insert some skew-removal code into the SMP
boot process. (Are any SMP wizards interested in helping?)

-- 
	-Colin

/* * Test code - please run this on your SMP PC system. * Thanks to everyone who has sent me results. (colin@nyx.net) * * This measures the skew between the time stamp counters on * two pentium-type processors. The purpose is to design kernel * timekeeping code for SMP systems. If the TSC registers are * reliably in sync, then things can be simplified considerably, * but verifying "reliably" requires testing on a variety of systems. * * This should be run on a mostly idle system, so that the two threads * can run simultaneously on two processors without getting interrupted. * An interrupt is easy to see (it looks like a number > 1000), so you * can just run it and hope before shutting things down. One or two * intrrupts don't hurt. * * What *does* hurt the results is DMA activity. If you see significant * skew (+/-10 or more) or "noisy" inconsistent data, please try unplugging * the network briefly. If that doesn't fix it, please tell me a lot * about your system, so I can try to figure out what's happening. * (Motherboard, processor speed, bus speed, BIOS brand, PCI cards.) * /proc/cpuinfo has lots of useful data. * * NEGATIVE NUMBERS mean that there is definitely significant skew. * If you see any number except the final skew *ever* being negative, * that is a very important piece of data. * * PLEASE COMPILE -O2. It's simple code which should be all in registers * except for the required memory references. Too much memory access * screws up the timing. * * LIBC5 instructions: you'll need to include <linux/shed.h> * for the CLONE_* flags, and link -lpthread to get the clone() * function. * * * Typical results look like this: * 50 54 * 46 46 * 46 46 * 46 76 * 46 58 * 46 46 * 46 80 * 46 46 * 46 46 * 46 46 * 48 46 * 46 46 * 48 46 * 46 46 * 48 46 * 46 46 * 48 46 * 46 46 * 48 46 * 50 50 * 44 58 * 46 76 * 46 76 * 46 84 * 46 46 * min1 = 44, min2 = 46, skew = -2 * * The numbers might vary from 40 to 350. The larger your bus multiplier, * the larger the numbers. 400 MHz systems with 66 MHz buses will produce * large numbers. */ #include <stdio.h>

#include <sched.h> /* linux/shed.h on libc5 */

#ifndef __OPTIMIZE__ #error Please compile with optimization on. #endif

/* * These are defined at the end of the file to make sure they * don't get inlined. */ static unsigned send(void); static unsigned receive(void);

#define NSAMPLES 25

unsigned master_send[NSAMPLES], master_receive[NSAMPLES]; unsigned slave_send[NSAMPLES], slave_receive[NSAMPLES];

/* The master sends first. */ static int master(void *arg) { int i;

(void)arg;

for (i = 0; i < NSAMPLES; i++) { master_send[i] = send(); master_receive[i] = receive(); }

/* Wait for slave to store last datum */ while (!receive_ready) ;

return 0; }

/* The slave recived first */ static int slave(void *arg) { int i;

(void)arg;

for (i = 0; i < NSAMPLES; i++) { slave_receive[i] = receive(); slave_send[i] = send(); }

/* Tell master we're done */ receive_ready = 1; _exit(); }

int main(void) { int i; unsigned min1, min2, delta; int delta1[NSAMPLES], delta2[NSAMPLES]; /* Stack for second process */ static char child_stack[100000];

#define CLONE_ALL (CLONE_VM | CLONE_FS | CLONE_FILES | CLONE_SIGHAND | CLONE_PID)

/* * We want master and slave running simultaneously, each on a * different processor. Unfortunately, user space offers no * guarantees, but on an idle machine, it should work. */ clone(slave, child_stack + sizeof(child_stack), CLONE_ALL, 0); master(0);

for (i = 0; i < NSAMPLES; i++) { delta1[i] = min1 = slave_receive[i] - master_send[i]; delta2[i] = min2 = master_receive[i] - slave_send[i]; printf("%9d %9d\n", min1, min2); }

for (i = 1; i < NAMPLES-1; i++) { if (min1 > delta1[i]) min1 = delta1[i]; if (min2 > delta2[i]) min2 = delta2[i]; } printf("V2: min1 = %u, min2 = %u, diff = %d\n", min1, min2, (int)min1 - (int)min2); return 0; }

/* * Variables for interprocessor communications. * Padded to make sure thre is no cache line interference. */ static volatile int pad0[16]; static volatile int receive_ready; static volatile int pad1[15]; static volatile int signal_sent; static volatile int pad2[15];

#define rdtsc(hi,lo) asm volatile("rdtsc" : "=a" (lo), "=d" (hi))

/* Send a signal, returning the TSC just before it is sent. */ static unsigned send(void) { unsigned hi, lo;

/* Send to slave */ signal_sent = 0; while (!receive_ready) ; rdtsc(hi,lo); /* Waste time */ receive_ready = 0; rdtsc(hi,lo); signal_sent = 1;

return lo; }

/* Receive a signal, returning the TSC just after it is received. */ static unsigned receive(void) { unsigned hi, lo;

while (signal_sent) ; receive_ready = 1; while (!signal_sent) ; rdtsc(hi,lo); return lo; }

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/