Re: [PATCH 2/2] x86, kvm: use kvmclock to compute TSC deadline value

From: Radim KrÄmÃÅ
Date: Thu Sep 15 2016 - 16:00:06 EST


2016-09-15 18:00+0200, Paolo Bonzini:
> On 15/09/2016 17:09, Radim KrÄmÃÅ wrote:
>> 2016-09-07 00:29+0200, Paolo Bonzini:
>>> Bad things happen if a guest using the TSC deadline timer is migrated.
>>> The guest doesn't re-calibrate the TSC after migration, and the
>>> TSC frequency can and will change unless your processor supports TSC
>>> scaling (on Intel this is only Skylake) or your data center is perfectly
>>> homogeneous.
>>
>> KVM has software scaling thanks to tsc_catchup and virtual_tsc_khz to
>> allow the guest to use hardware TSC directly. The software scaling
>> should recalibrate TSC on vmexits and is therefore losing precision in
>> between -- are you working around the imprecision?
>>
>> Host should translate the requested tsc deadline into nanoseconds based
>> on vcpu->arch.virtual_tsc_khz, which doesn't change on migration, so the
>> solution shouldn't work, because we'd be scaling twice.
>
> tsc_catchup is pretty much broken and not really fixable; IIRC it causes
> a huge performance loss because it updates kvmclock on every vmexit.

Yeah ... catchup needs to update TSC_OFFSET, which can't be done without
updating kvmclock.

> So virtual_tsc_khz does change on migration, at least if your host
> doesn't have TSC scaling (which is pretty much all Intel hosts in
> existence).

Ugh, I'd consider exposing constant TSC to the guest (which depends
solely on CPUID -- modern models have it), allowing migration, and not
preserving the TSC rate as a userspace bug.

Invariant TSC seems to prevent migration and the only thing that changed
between constant and invariant TSC is that invariant TSC doesn't stop in
guest-controlled deep C states.

Hm, and Linux uses TSC_DEADLINE_TIMER without invariant TSC: setting a
timer and halting could mean that the timer never fires ...
maybe real hardware always has both, so it is an implicit condition?

> Safety against TSC rate changes is pretty much the reason
> why kvmclock exists: it used to be that TSC rate changed on pCPU
> migration, now it's only host migration but the idea is the same. :)

Yep. I think that TSC shouldn't have been allowed outside of kvmclock.

>> I don't the general idea. TSC and TSC_DEADLINE_TIMER is a standard
>> feature; If we don't emulate it, then we should not expose it in CPUID.
>> rdtsc and wrmsr would still be available for kvmclock, but naive
>> operating systems would not be misled.
>
> We do emulate it TSC_DEADLINE_TIMER correctly. The problem is that the
> kernel sets a deadline in nanoseconds and the nanoseconds->TSC relation
> can change.
>
> It's exactly the same issue that kvmclock is handling, except that
> kvmclock handles TSC->nanoseconds and this one is the other way round.

True, it is the TSC that is not emulated correctly.
TSC_DEADLINE_TIMER is a victim of past decisions.

> (Another small benefit of this patch vs. the native clockevent is that
> it doesn't waste time calibrating the LAPIC timer when we know the
> frequency from kvmclock).

calibrate_APIC_clock() just returns 0 early when using the TSC deadline
timer and setup re-uses the TSC calibration that was done earlier, so
it's even smaller benefit.

>> When we are already going the paravirtual route, we could add an
>> interface that accepts the deadline in kvmclock nanoseconds.
>> It would be much more maintanable than adding a fragile paravirtual
>> layer on top of random interfaces.
>
> Good idea.

I'll prepare a prototype.

> This still wouldn't handle old hosts of course.

The question is whether we want to carry around 150 LOC because of old
hosts. I'd just fix Linux to avoid deadline TSC without invariant TSC.
:)

>>> diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
>>> @@ -245,6 +246,155 @@ static void kvm_shutdown(void)
>>> +
>>> +static int kvmclock_lapic_timer_stop(struct clock_event_device *evt)
>>> +{
>>> + kvmclock_lapic_timer_setup(APIC_LVT_MASKED);
>>> + wrmsrl(MSR_IA32_TSC_DEADLINE, -1);
>>
>> This wrmsrl() can't inject an interrupt, because later switch to
>> TSCDEADLINE mode disarms the timer, but it would be better to remove it.
>
>You mean leave only kvmclock_lapic_timer_setup(APIC_LVT_MASKED)?

Yes. I didn't understand the -1 write at all, when 0 is the one that
disarms the timer and even that isn't needed.

>>> + /* Cap the wait to avoid overflow. */
>>> + if (unlikely(delta_ns > KVMCLOCK_TSC_DEADLINE_MAX))
>>> + delta_ns = KVMCLOCK_TSC_DEADLINE_MAX;
>>> +
>>> + /*
>>> + * delta_tsc = delta_ns << (32-tsc_shift) / tsc_to_system_mul.
>>> + * The shift is split in two steps so that a 38 bits (275 s)
>>> + * deadline fits into the 64-bit dividend.
>>> + */
>>> + shift = 32 - src->tsc_shift;
>>> +
>>> + /* First shift step... */
>>> + delta_ns <<= 64 - KVMCLOCK_TSC_DEADLINE_MAX_BITS;
>>> + shift -= 64 - KVMCLOCK_TSC_DEADLINE_MAX_BITS;
>>
>> Considering that the usual shift seems to be -1, we'd be losing 7 bits
>> of precision. The nice thing is that the precision is bounded up to the
>> cap ... I dislike the cap and LOC count, though, so the following,
>> although not as tightly bounded seems a bit better:
>>
>> u64 mult = div_u64(1ULL << 63, tsc_to_system_mul);
>> int shift = 63 - (32 - tsc_shift));
>>
>> tsc = src->tsc_timestamp + mul_u64_u64_shr(delta_ns, mult, shift);
>>
>> mult should be quite stable, so we could cache it.
>> (And if we didn't care about performance loss from 4(?) divisions, we
>> could have much nicer shl_div().)
>
> What are the four divisions?

Ah, sorry, just a guess of how many divisions would it take to do
"u128 / u32", resp. "(u64 << u6) / u32" (the shl_div() in question),
with full precision. I don't know big number arithmetic well enough.

>>> + /* Under KVM the LAPIC timer always runs in deep C-states. */
>>> + .features = CLOCK_EVT_FEAT_ONESHOT | CLOCK_EVT_FEAT_KTIME,
>>> + .set_state_shutdown = kvmclock_lapic_timer_stop,
>>> + .set_state_oneshot = kvmclock_lapic_timer_set_oneshot,
>>> + .set_next_ktime = kvmclock_lapic_next_ktime,
>>> + .mult = 1,
>>> + /* Make LAPIC timer preferrable over percpu HPET */
>>> + .rating = 150,
>>> + .irq = -1,
>>> +};
>>> +static DEFINE_PER_CPU(struct clock_event_device, kvm_events);
>>> +
>>> +static void kvmclock_local_apic_timer_interrupt(void)
>>> +{
>>> + int cpu = smp_processor_id();
>>> + struct clock_event_device *evt = &per_cpu(kvm_events, cpu);
>>> +
>>> + /*
>>> + * Defer to the native clockevent if ours hasn't been setup yet.
>>> + */
>>> + if (!evt->event_handler) {
>>> + native_local_apic_timer_interrupt();
>>
>> Don't we completely replace the native apic timer, so it can't even be
>> registered? The comment doesn't explain what we expect from the call,
>> so it's a bit confusing.
>>
>> I think the call expects that per_cpu(lapic_events, cpu).event_handler
>> == NULL, so we do it to print the warning and disable the LAPIC timer.
>
> This pvop is always called instead of native_local_apic_timer_interrupt.
> We need to defer to the native implementation if the kvmclock
> clocksource is not in use, for example if there's no TSC deadline timer
> in the guest.

No, the pvop is for that. If there is no TSC deadline timer, then the
pvop will call native_local_apic_timer_interrupt(), because we don't
overwrite it:

>>> + if (boot_cpu_has(X86_FEATURE_TSC_DEADLINE_TIMER) &&
>>> + !disable_apic && !disable_apic_timer) {
>>> + pv_time_ops.local_apic_timer_interrupt = kvmclock_local_apic_timer_interrupt;
>>> + x86_init.timers.setup_percpu_clockev = setup_kvmclock_timer;
>>> + x86_cpuinit.setup_percpu_clockev = setup_kvmclock_timer;
>>> + }
>
> So I should do s/hasn't been setup yet/isn't in use/.

Worse that no comment, IMO. ;)

I still think it is only to call this block in
native_local_apic_timer_interrupt():

if (!evt->event_handler) {
pr_warning("Spurious LAPIC timer interrupt on cpu %d\n", cpu);
/* Switch it off */
lapic_timer_shutdown(evt);
return;
}

Those dependencies make it confusing.