Interesting. The Andi has been bugging me for a similarly designed
per-cpu TSC clocksource, but just for generic use. I'm a little
skeptical that it will be 100% without error, since anything dealing w/
the TSCs have been nothing but trouble in my mind, but this looks like a
good proving ground for the concept.
It was mentioned to me that the clocksource approach helped cleanup some
of the xen time changes (is that really true? :), but there were still
some outstanding issues (time inconsistencies, perhaps?). I'm just
curious if there are any details about the issues there, or if I
misunderstood?
This does not implement setting the hypervisor clock; nor does it
implement non-independent time, so hypervisor wallclock changes will
not affect the guest.
Hmmm. I'm not sure if I understood that last line or not. I guess I need
to think a bit about CLOCK_REALTIME vs CLOCK_MONOTONIC wrt
hypervisiors.
I guess the question is "who owns time?" the guest OS (does it have its
own CLOCK_REALTIME, independent of other guests?) or does the
hypervisor? What does NTPd running on a guest actually adjust?
+/* Permitted clock jitter, in nsecs, beyond which a warning will be printed. */
+static unsigned long permitted_clock_jitter = 10000000UL; /* 10ms */
+static int __init __permitted_clock_jitter(char *str)
+{
+ permitted_clock_jitter = simple_strtoul(str, NULL, 0);
+ return 1;
+}
+__setup("permitted_clock_jitter=", __permitted_clock_jitter);
permitted_clock_jitter is a little vague and might get confused w/ the
NTP notion of jitter. Is there a better name, or could we get a xen_
prefix there?
+/* These are perodically updated in shared_info, and then copied here. */
+struct shadow_time_info {
+ u64 tsc_timestamp; /* TSC at last update of time vals. */
+ u64 system_timestamp; /* Time, in nanosecs, since boot. */
+ u32 tsc_to_nsec_mul;
+ u32 tsc_to_usec_mul;
Hmmm. Keeping separate cycle->usec and cycle->nsec multipliers is an
interesting optimization. I'd even consider it for the generic
clocksource code, but I suspect recalculating the independent adjustment
factors for both kills the performance benefit. Have you actually
compaired against the cost of the /1000 going from nsec to usec?
This structure is derived from the time structure Xen maintains in shared memory. If the shared memory version matches the local shadow version, we don't need to sync the two (I'm not sure if this is actually used in this patch).+ int tsc_shift;
+ u32 version;
Errr.. Why is a version value necessary?
+
+static DEFINE_PER_CPU(struct shadow_time_info, shadow_time);
+
+/* Keep track of last time we did processing/updating of jiffies and xtime. */
+static u64 processed_system_time; /* System time (ns) at last processing. */
+static DEFINE_PER_CPU(u64, processed_system_time);
Errr. That would confuse me right off. Global and per-cpu values having
the same name?
+/* How much CPU time was spent blocked and how much was 'stolen'? */
+static DEFINE_PER_CPU(u64, processed_stolen_time);
+static DEFINE_PER_CPU(u64, processed_blocked_time);
These seem like more generic accounting structures. Surely other
virtualized arches have something similar? Something that should be
looked into.
+ if (shift < 0)
+ delta >>= -shift;
+ else
+ delta <<= shift;
I think there is a shift_right() macro that can avoid this.
Also I'm not sure I follow why you shift before multiply instead of
multiply before shift? Does that not hurt your precision?
+#ifdef __i386__
+ __asm__ (
+ "mul %5 ; "
+ "mov %4,%%eax ; "
+ "mov %%edx,%4 ; "
+ "mul %5 ; "
+ "xor %5,%5 ; "
+ "add %4,%%eax ; "
+ "adc %5,%%edx ; "
+ : "=A" (product), "=r" (tmp1), "=r" (tmp2)
+ : "a" ((u32)delta), "1" ((u32)(delta >> 32)), "2" (mul_frac) );
+#elif __x86_64__
+ __asm__ (
+ "mul %%rdx ; shrd $32,%%rdx,%%rax"
+ : "=a" (product) : "0" (delta), "d" ((u64)mul_frac) );
+#else
+#error implement me!
+#endif
+
+ return product;
+}
I think we need some generic mul_llxl_ll() wrappers here.
+get_nsec_offset is a little generic for a name. I know xen_ prefixes
+static u64 get_nsec_offset(struct shadow_time_info *shadow)
everywhere are irritating, but maybe something a little more specific
would be a good idea.
+static cycle_t xen_clocksource_read(void)
+{
+ struct shadow_time_info *shadow = &per_cpu(shadow_time, smp_processor_id());
+
+ get_time_values_from_xen();
+
+ return shadow->system_timestamp + get_nsec_offset(shadow);
+}
Does get_time_values_from_xen() really need to be called on every
clocksource_read call?
+static void init_cpu_khz(void)
+{
+ u64 __cpu_khz = 1000000ULL << 32;
+ struct vcpu_time_info *info;
+ info = &HYPERVISOR_shared_info->vcpu_info[0].time;
+ do_div(__cpu_khz, info->tsc_to_system_mul);
+ if (info->tsc_shift < 0)
+ cpu_khz = __cpu_khz << -info->tsc_shift;
+ else
+ cpu_khz = __cpu_khz >> info->tsc_shift;
+}
Err.. That could use some comments.
+static struct clocksource xen_clocksource = {
+ .name = "xen",
+ .rating = 400,
+ .read = xen_clocksource_read,
+ .mask = ~0,
+ .mult = 1, /* time directly in nanoseconds */
+ .shift = 0,
+ .is_continuous = 1
+};
Hmmm. The 1/0 mul/shift pair is interesting. Is it expected that NTP
does not ever adjust this clocksource? If not the clocksource_adjust()
function won't do well with this at all, so you might consider something
like:
#define XEN_SHIFT 22
.mult = 1<<XEN_SHIFT
.shift = XEN_SHIFT
+static void init_missing_ticks_accounting(int cpu)
+{
+ struct vcpu_register_runstate_memory_area area;
+ struct vcpu_runstate_info *runstate = &per_cpu(runstate, cpu);
+
+ memset(runstate, 0, sizeof(*runstate));
+
+ area.addr.v = runstate;
+ HYPERVISOR_vcpu_op(VCPUOP_register_runstate_memory_area, cpu, &area);
+
+ per_cpu(processed_blocked_time, cpu) =
+ runstate->time[RUNSTATE_blocked];
+ per_cpu(processed_stolen_time, cpu) =
+ runstate->time[RUNSTATE_runnable] +
+ runstate->time[RUNSTATE_offline];
+}
Again, this accounting seems like it could be generically useful.
+__init void time_init_hook(void)
+{
+ get_time_values_from_xen();
+
+ processed_system_time = per_cpu(shadow_time, 0).system_timestamp;
+ per_cpu(processed_system_time, 0) = processed_system_time;
+
+ init_cpu_khz();
+ printk(KERN_INFO "Xen reported: %u.%03u MHz processor.\n",
+ cpu_khz / 1000, cpu_khz % 1000);
+
+ /* Cannot request_irq() until kmem is initialised. */
+ late_time_init = setup_cpu0_timer_irq;
+
+ init_missing_ticks_accounting(0);
+
+ clocksource_register(&xen_clocksource);
+
+ /* Set initial system time with full resolution */
+ xen_get_wallclock(&xtime);
+ set_normalized_timespec(&wall_to_monotonic,
+ -xtime.tv_sec, -xtime.tv_nsec);
+}
Some mention of which functions require to hold what on xtime_lock would
be useful as well (applies to this function as well as the previous ones
already commented on).
My only thoughts after looking at it: Using nanoseconds as a primary
unit is often easier to work with, but less efficient. So rather then
keeping a tsc_timestamp + system_timestamp in two different units, why
not keep a calculated TSC base that includes the "cycles since boot"
which is adjusted in the same manner internally to Xen as the
system_timestamp is. Then let the timekeeping code do the conversion for
you.
I haven't fully thought about what else it would affect in the above (I
realize stolen_time, etc is in nsecs), but it might be something to
consider.
Am I making any sense or just babbling?