Re: [PATCH 1/2] Make the TSC safe to be used by gettimeofday().

From: Suleiman Souhlal
Date: Mon Nov 13 2006 - 21:26:19 EST


Andi Kleen wrote:
On Tuesday 14 November 2006 02:08, Suleiman Souhlal wrote:

This is done by a per-cpu vxtime structure that stores the last TSC and HPET
values.

Whenever we switch to a userland process after a HLT instruction has been
executed or after the CPU frequency has changed, we force a new read of the
TSC, HPET and xtime so that we know the correct frequency we have to deal
with.


Hmm, interesting approach.


With this, we can safely use RDTSC in gettimeofday() in CPUs where the
TSCs are not synchronized, such as Opterons, instead of doing a very expensive
HPET read.


+ cpu = hard_smp_processor_id();


Why not smp_processor_id ?

Because CPUID returns the APIC ID, and I was under the impression that the cpu numbers
smp_processor_id() were dynamically allocated and didn't necessarily match the
APIC ID.

+
+ apicid = hard_smp_processor_id();


The apicid <-> linux cpuid mapping should be 1:1 so i don't know
why you do this separately. All the uses of hard_smp_processor_id
are bogus.


+
+ /*
+ * We will need to also do this when switching to kernel tasks if we
+ * want to use the per-cpu monotonic_clock in the kernel
+ */
+ if (vxtime.pcpu[apicid].need_update == 1 && next_p->mm != NULL)
+ vxtime_update_pcpu();


Why? It should be really only needed on HLT/cpufreq change, or?
>
Anyways, I'm not very fond of adding code to the context switch critical
path.

Yes, it's only needed on HLT and cpufreq change.
The code here is to force a "resynch" with the HPET if we've done a HLT. It has to be done before we switch to any userland thread that might use the per-cpu vxtime. switch_to() seemed like the most natural place to put this.



seq = read_seqbegin(&xtime_lock);
+ preempt_disable();
+ cpu = hard_smp_processor_id();


Again, shouldn't use hard



- sec = xtime.tv_sec;
- usec = xtime.tv_nsec / NSEC_PER_USEC;
+ sec = vxtime.pcpu[cpu].tv_sec;
+ usec = vxtime.pcpu[cpu].tv_usec;

/* i386 does some correction here to keep the clock monotonous even when ntpd is fixing drift.
@@ -135,9 +138,13 @@ void do_gettimeofday(struct timeval *tv)
be found. Note when you fix it here you need to do the same
in arch/x86_64/kernel/vsyscall.c and export all needed
variables in vmlinux.lds. -AK */ - usec += do_gettimeoffset();
+ t = get_cycles_sync();
+ x = (((t - vxtime.pcpu[cpu].last_tsc) *
+ vxtime.pcpu[cpu].tsc_nsquot) >> NS_SCALE) / NSEC_PER_USEC;
+ usec += x;

} while (read_seqretry(&xtime_lock, seq));
+ preempt_enable();


If it can be implemented without races in user space, why does
it need preempt disable in kernel space?

Because there is indeed a small race in user space (as you noted below) that I still haven't found how to address.

The faster way to access all the vxtime code would be to stick
a pointer to the per CPU vxtime data into the PDA



+#define NSEC_PER_TICK (NSEC_PER_SEC / HZ)
+extern unsigned long hpet_tick;
+
+extern unsigned long vxtime_hz;


No externs in .c files. Happened earlier too I think


+
static __always_inline void timeval_normalize(struct timeval * tv)
{
time_t __sec;
@@ -57,35 +67,107 @@ static __always_inline void timeval_norm
}
}

+inline int apicid(void)
+{
+ int cpu;
+
+ __asm __volatile("cpuid" : "=b" (cpu) : "a" (1) : "cx", "dx");
+ return (cpu >> 24);


The faster way to do this is to use LSL from a magic GDT entry.

A cow-orker suggested that we use SIDT and encode the CPU number in the limit of the IDT, which should be even faster than LSL.


+}
+
static __always_inline void do_vgettimeofday(struct timeval * tv)
{
long sequence, t;
unsigned long sec, usec;
+ int cpu;

do {
sequence = read_seqbegin(&__xtime_lock);
-
- sec = __xtime.tv_sec;
- usec = __xtime.tv_nsec / 1000;
-
- if (__vxtime.mode != VXTIME_HPET) {
- t = get_cycles_sync();
- if (t < __vxtime.last_tsc)
- t = __vxtime.last_tsc;
- usec += ((t - __vxtime.last_tsc) *
- __vxtime.tsc_quot) >> 32;
- /* See comment in x86_64 do_gettimeofday. */
- } else {
- usec += ((readl((void __iomem *)
- fix_to_virt(VSYSCALL_HPET) + 0xf0) -
- __vxtime.last) * __vxtime.quot) >> 32;
- }
- } while (read_seqretry(&__xtime_lock, sequence));
+ cpu = apicid();
+
+ sec = __vxtime.pcpu[cpu].tv_sec;
+ usec = __vxtime.pcpu[cpu].tv_usec;
+ rdtscll(t);
+
+ usec += (((t - __vxtime.pcpu[cpu].last_tsc) *
+ __vxtime.pcpu[cpu].tsc_nsquot) >> NS_SCALE) / NSEC_PER_USEC;
+ } while (read_seqretry(&__xtime_lock, sequence) || apicid() != cpu);


Hmm, isn't there still a (unlikely, but possible) race:

CPU #0 CPU #1
read cpu number
switch to CPU #1
read TSC switch back to CPU #0
check cpu number
check succeeds, but you got wrong TSC data

How do you prevent that? I suppose you could force the seqlock to retry
in this case in the context switch, but I don't think you do that?

I couldn't figure out how to tell if a context switch has happened from userland. I tried putting a per-cpu context switch count, but I couldn't figure out how to get it atomically along with the CPU number..

Also I'm not sure what your context switch code is good for.

-Andi

Thanks for the feedback.

-- Suleiman
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/