Re: [PATCH 2/2] x86, kvm: use kvmclock to compute TSC deadline value

From: Paolo Bonzini
Date: Thu Sep 15 2016 - 17:02:59 EST




On 15/09/2016 21:59, Radim KrÄmÃÅ wrote:
> 2016-09-15 18:00+0200, Paolo Bonzini:
>> So virtual_tsc_khz does change on migration, at least if your host
>> doesn't have TSC scaling (which is pretty much all Intel hosts in
>> existence).
>
> Ugh, I'd consider exposing constant TSC to the guest (which depends
> solely on CPUID -- modern models have it), allowing migration, and not
> preserving the TSC rate as a userspace bug.

Technically it is, but without hardware help I guess it's justified...

> Invariant TSC seems to prevent migration and the only thing that changed
> between constant and invariant TSC is that invariant TSC doesn't stop in
> guest-controlled deep C states.
>
> Hm, and Linux uses TSC_DEADLINE_TIMER without invariant TSC: setting a
> timer and halting could mean that the timer never fires ...
> maybe real hardware always has both, so it is an implicit condition?

Yeah, I guess so. And the kvmclock-based clockevent would not be able
to fix this. So we also need to blacklist the TSC deadline timer if
kvmclock doesn't set PVCLOCK_TSC_STABLE_BIT.

>> Safety against TSC rate changes is pretty much the reason
>> why kvmclock exists: it used to be that TSC rate changed on pCPU
>> migration, now it's only host migration but the idea is the same. :)
>
> Yep. I think that TSC shouldn't have been allowed outside of kvmclock.

Luckily TSC is only used by kvmclock (which is okay) and... the TSC
deadline timer. Also by nested KVM's kvmclock, but that's the last of
our worries I guess. So we're close.

>>> When we are already going the paravirtual route, we could add an
>>> interface that accepts the deadline in kvmclock nanoseconds.
>>> It would be much more maintanable than adding a fragile paravirtual
>>> layer on top of random interfaces.
>>
>> Good idea.
>
> I'll prepare a prototype.

So how would this work? A single MSR, used after setting TSC deadline
mode in LVTT? Could you write it and read TSC deadline or vice versa?

My idea would be "yes" for writing nsec deadline and reading TSC
deadline, but "no" for writing TSC deadline and reading nsec deadline.
In the latter case, reading nsec deadline might return an impossible
value such as -1; this lets userspace decide whether to set a nsec-based
deadline or a TSC-based deadline after migration. If you have other
ideas, I'm all ears!

>> This still wouldn't handle old hosts of course.
>
> The question is whether we want to carry around 150 LOC because of old
> hosts. I'd just fix Linux to avoid deadline TSC without invariant TSC.
> :)

Yes, that would automatically blacklist it on KVM. You'd also need to
update the recent optimization to the TSC deadline timer, to also work
on other APIC timer modes or at least in your new PV mode.

>>> Don't we completely replace the native apic timer, so it can't even be
>>> registered? The comment doesn't explain what we expect from the call,
>>> so it's a bit confusing.
>>>
>>> I think the call expects that per_cpu(lapic_events, cpu).event_handler
>>> == NULL, so we do it to print the warning and disable the LAPIC timer.
>>
>> This pvop is always called instead of native_local_apic_timer_interrupt.
>> We need to defer to the native implementation if the kvmclock
>> clocksource is not in use, for example if there's no TSC deadline timer
>> in the guest.
>
> No, the pvop is for that. If there is no TSC deadline timer, then the
> pvop will call native_local_apic_timer_interrupt(), because we don't
> overwrite it:
>
>>>> + if (boot_cpu_has(X86_FEATURE_TSC_DEADLINE_TIMER) &&
>>>> + !disable_apic && !disable_apic_timer) {
>>>> + pv_time_ops.local_apic_timer_interrupt = kvmclock_local_apic_timer_interrupt;
>>>> + x86_init.timers.setup_percpu_clockev = setup_kvmclock_timer;
>>>> + x86_cpuinit.setup_percpu_clockev = setup_kvmclock_timer;
>>>> + }
>>
>> So I should do s/hasn't been setup yet/isn't in use/.
>
> Worse that no comment, IMO. ;)
>
> I still think it is only to call this block in
> native_local_apic_timer_interrupt():
>
> if (!evt->event_handler) {
> pr_warning("Spurious LAPIC timer interrupt on cpu %d\n", cpu);
> /* Switch it off */
> lapic_timer_shutdown(evt);
> return;
> }

No, it was needed for something else. :) I just don't recall for what
anymore, since your observation on
pv_time_ops.local_apic_timer_interrupt is obviously correct.

Paolo