Re: [PATCH 2/2] x86, kvm: use kvmclock to compute TSC deadline value

From: Paolo Bonzini
Date: Thu Sep 15 2016 - 12:00:32 EST




On 15/09/2016 17:09, Radim KrÄmÃÅ wrote:
> 2016-09-07 00:29+0200, Paolo Bonzini:
>> Bad things happen if a guest using the TSC deadline timer is migrated.
>> The guest doesn't re-calibrate the TSC after migration, and the
>> TSC frequency can and will change unless your processor supports TSC
>> scaling (on Intel this is only Skylake) or your data center is perfectly
>> homogeneous.
>
> KVM has software scaling thanks to tsc_catchup and virtual_tsc_khz to
> allow the guest to use hardware TSC directly. The software scaling
> should recalibrate TSC on vmexits and is therefore losing precision in
> between -- are you working around the imprecision?
>
> Host should translate the requested tsc deadline into nanoseconds based
> on vcpu->arch.virtual_tsc_khz, which doesn't change on migration, so the
> solution shouldn't work, because we'd be scaling twice.

tsc_catchup is pretty much broken and not really fixable; IIRC it causes
a huge performance loss because it updates kvmclock on every vmexit.

So virtual_tsc_khz does change on migration, at least if your host
doesn't have TSC scaling (which is pretty much all Intel hosts in
existence). Safety against TSC rate changes is pretty much the reason
why kvmclock exists: it used to be that TSC rate changed on pCPU
migration, now it's only host migration but the idea is the same. :)

> I don't the general idea. TSC and TSC_DEADLINE_TIMER is a standard
> feature; If we don't emulate it, then we should not expose it in CPUID.
> rdtsc and wrmsr would still be available for kvmclock, but naive
> operating systems would not be misled.

We do emulate it TSC_DEADLINE_TIMER correctly. The problem is that the
kernel sets a deadline in nanoseconds and the nanoseconds->TSC relation
can change.

It's exactly the same issue that kvmclock is handling, except that
kvmclock handles TSC->nanoseconds and this one is the other way round.

(Another small benefit of this patch vs. the native clockevent is that
it doesn't waste time calibrating the LAPIC timer when we know the
frequency from kvmclock).

> When we are already going the paravirtual route, we could add an
> interface that accepts the deadline in kvmclock nanoseconds.
> It would be much more maintanable than adding a fragile paravirtual
> layer on top of random interfaces.

Good idea. This still wouldn't handle old hosts of course.

>> diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
>> @@ -245,6 +246,155 @@ static void kvm_shutdown(void)
>> +
>> +static int kvmclock_lapic_timer_stop(struct clock_event_device *evt)
>> +{
>> + kvmclock_lapic_timer_setup(APIC_LVT_MASKED);
>> + wrmsrl(MSR_IA32_TSC_DEADLINE, -1);
>
> This wrmsrl() can't inject an interrupt, because later switch to
> TSCDEADLINE mode disarms the timer, but it would be better to remove it.

You mean leave only kvmclock_lapic_timer_setup(APIC_LVT_MASKED)?

>> + /* Cap the wait to avoid overflow. */
>> + if (unlikely(delta_ns > KVMCLOCK_TSC_DEADLINE_MAX))
>> + delta_ns = KVMCLOCK_TSC_DEADLINE_MAX;
>> +
>> + /*
>> + * delta_tsc = delta_ns << (32-tsc_shift) / tsc_to_system_mul.
>> + * The shift is split in two steps so that a 38 bits (275 s)
>> + * deadline fits into the 64-bit dividend.
>> + */
>> + shift = 32 - src->tsc_shift;
>> +
>> + /* First shift step... */
>> + delta_ns <<= 64 - KVMCLOCK_TSC_DEADLINE_MAX_BITS;
>> + shift -= 64 - KVMCLOCK_TSC_DEADLINE_MAX_BITS;
>
> Considering that the usual shift seems to be -1, we'd be losing 7 bits
> of precision. The nice thing is that the precision is bounded up to the
> cap ... I dislike the cap and LOC count, though, so the following,
> although not as tightly bounded seems a bit better:
>
> u64 mult = div_u64(1ULL << 63, tsc_to_system_mul);
> int shift = 63 - (32 - tsc_shift));
>
> tsc = src->tsc_timestamp + mul_u64_u64_shr(delta_ns, mult, shift);
>
> mult should be quite stable, so we could cache it.
> (And if we didn't care about performance loss from 4(?) divisions, we
> could have much nicer shl_div().)

What are the four divisions? (And yes, I agree this is quite nice).

>> +/*
>> + * The local apic timer can be used for any function which is CPU local.
>> + */
>> +static struct clock_event_device kvm_clockevent = {
>> + .name = "lapic",
>
> "lapic" is already used -- what about "kvm-lapic" or "kvm-tsc-deadline"?

Of course.

>> + /* Under KVM the LAPIC timer always runs in deep C-states. */
>> + .features = CLOCK_EVT_FEAT_ONESHOT | CLOCK_EVT_FEAT_KTIME,
>> + .set_state_shutdown = kvmclock_lapic_timer_stop,
>> + .set_state_oneshot = kvmclock_lapic_timer_set_oneshot,
>> + .set_next_ktime = kvmclock_lapic_next_ktime,
>> + .mult = 1,
>> + /* Make LAPIC timer preferrable over percpu HPET */
>> + .rating = 150,
>> + .irq = -1,
>> +};
>> +static DEFINE_PER_CPU(struct clock_event_device, kvm_events);
>> +
>> +static void kvmclock_local_apic_timer_interrupt(void)
>> +{
>> + int cpu = smp_processor_id();
>> + struct clock_event_device *evt = &per_cpu(kvm_events, cpu);
>> +
>> + /*
>> + * Defer to the native clockevent if ours hasn't been setup yet.
>> + */
>> + if (!evt->event_handler) {
>> + native_local_apic_timer_interrupt();
>
> Don't we completely replace the native apic timer, so it can't even be
> registered? The comment doesn't explain what we expect from the call,
> so it's a bit confusing.
>
> I think the call expects that per_cpu(lapic_events, cpu).event_handler
> == NULL, so we do it to print the warning and disable the LAPIC timer.

This pvop is always called instead of native_local_apic_timer_interrupt.
We need to defer to the native implementation if the kvmclock
clocksource is not in use, for example if there's no TSC deadline timer
in the guest.

So I should do s/hasn't been setup yet/isn't in use/.

>> + return;
>> + }
>> +
>> + inc_irq_stat(apic_timer_irqs);
>> + evt->event_handler(evt);
>> +}
>> +
>> +/*
>> + * Setup the local APIC timer for this CPU. Copy the initialized values
>> + * of the boot CPU and register the clock event in the framework.
>> + */
>> +static void setup_kvmclock_timer(void)
>> +{
>> + struct clock_event_device *evt = this_cpu_ptr(&kvm_events);
>> +
>> + kvmclock_lapic_timer_stop(evt);
>
> Why stop the timer here? We don't even know if the clockevent device
> will be used, so it seems out of order.

Yeah, you're right. And as you noticed we never switch from native to
pv clockevent, because setup_percpu_clockev is replaced completely.

Paolo

>> + memcpy(evt, &kvm_clockevent, sizeof(*evt));
>> + evt->cpumask = cpumask_of(smp_processor_id());
>> + clockevents_register_device(evt);
>> +}
>> +#endif
>> +
>> void __init kvmclock_init(void)
>> {
>> struct pvclock_vcpu_time_info *vcpu_time;