Re: do_gettimeofday vs. rdtsc in the scheduler

From: Andrea Arcangeli (
Date: Thu Sep 19 2002 - 13:02:29 EST

On Tue, Sep 17, 2002 at 06:04:33PM -0700, James Cleverdon wrote:
> have a separate clock input for it that runs at 1 MHz so skew and

The clock input should be the same, or they can always run out of
synchrony if you left it running forever. The timer generation is an
analogic thing, the reception is digital, so having a single timer
guarantees no counter skew.

If the precision we'd need from the timer driving gettimeofday would be
1HZ, so 1 tick per second, you could make it scale perfectly without
oscillations on a 256G box.

you simply can't do that with a < 1nanosecond tick period on more than a
few cpus, because of physics, or it happens what's been mentioned a
number of times on this thread (oscillations generated by the latency of
the signal delivery or further slowdown in accessing the information
with overhead in the interconnects).

The best hardware solution to this problem is to have two cpu registers
increased by two timers, one is the regular cpu tick (TSC) that we have
today, that could even go away with asynchronous cpus, and the other
timer would be the new "real time timer", a 10/100khz clock delivered to
all the cpus that goes to increase such in-cpu-core counter (so that it
can be read from userspace too inside vgettimeofday and with extremely
low latency, exactly like the current tsc, but driven by such a
secondary low frequency timer that will tell us about the time changes).
10/100usec should be much more than enough margin to deliver this timer
to all the hundred cpus with a very small oscillation. And no software
that I'm aware about needs a time-of-day precision over 10/100usec. An
interrupt itself is going to take some usec. A context switch as well is
going to take more than 10usec, that's the important bit to guarantee
gettimeofday to be monothone, different threads can have a minor
difference in the perception of the time, dominated by the speed of
light delivery of the timer signal, that's not a problem as far as it's

The TSC and also the system clock mentioned by Dave are way too fast to
be kept synchronized in a numa without introducing significant drifts
and oscillations.

If somebody really needs 1usec resolution, he will first need vsyscalls
to avoid enter/exit kernel latencies, likely he will need to run iopl
with irq disabled, and so it should be ok to use the TSC in such case
with a specialized hacked kernel config option (with all the disclaimer
that it would break if the cpu clock changes under you etc...) All mere
mortals will be perfectly fine with a 100khz clock for gettimeofday. If
sun did a 1mhz clock to achieve the above suggested design solution,
then they did the optimal thing IMHO.

Another approch would be to use separate timer sources per-cpu and to
re-resychronize every once in a while, at regular intervals that
guarantees the drift not to spread above the half of the time of the
shortest context switch, but it would need tedious software support with
knowledge of very lowevel hardware informations, so I'd definitely
prefer the previous mentioned solution that will require all hardware
vendors to get it right or it won't work. Like it's happening now with
the TSC, with the difference that the 100k timer would be doable, while
the TSC at 2ghz isn't doable.

Of course the cyclone timer and the HPET are the very next best thing
the hardware vendors could provide us on x86, and of course you cannot
do better than the cyclone and HPET without upgrading the cpu too,
because the cpu is simply missing a register to avoid hitting the
southbridge at every vgettimeofday. At least the good thing is that HPET
is mapped in a mmio region so we don't need to enter kernel but only to
access the southbridge from userspace and that saves a number of usec at
every gettimeofday.

All of this assumes gettimeofday is an important operation and that an
additional cpu sequence counter and an additional numa-shared timer
would payoff to make gettimeofday most efficient and most accurate on
all class of machines. It would be also an option to replace
the TSC with such new "real time counter" if adding a new counter is too
expensive, the TSC is almost unusable in its current too high frequency
form, it is useful only for microbenchmarking, so it's more a debugging
facility than a production feature, while the other would be a really
useful feature not only for debugging/benchmarking purposes.

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to
More majordomo info at
Please read the FAQ at

This archive was generated by hypermail 2b29 : Mon Sep 23 2002 - 22:00:27 EST