Re: [PATCH 1/5] x86/kvm: On KVM re-enable (e.g. after suspend), update clocks
From: Radim Krcmar
Date: Thu Mar 17 2016 - 11:10:21 EST
2016-03-16 16:07-0700, Andy Lutomirski:
> On Wed, Mar 16, 2016 at 3:59 PM, Radim Krcmar <rkrcmar@xxxxxxxxxx> wrote:
>> 2016-03-16 15:15-0700, Andy Lutomirski:
>>> FWIW, if you ever intend to support ART ("always running timer")
>>> passthrough, this is going to be a giant clusterfsck. Good luck. I
>>> haven't gotten a straight answer as to what hardware actually supports
>>> that thing, so even testing isn't no easy.
>>
>> Hm, AR TSC would be best handled by doing nothing ... dropping the
>> faking logic just became tempting.
ART is different from what I initially thought, it's the underlying
mechanism for invariant TSC and nothing more ... we already forbid
migrations when the guest knows about invariant TSC, so we could do the
same and let ART be virtualized. (Suspend has to be forbidden too.)
> As it stands, ART is screwed if you adjust the VMCS's tsc offset. But
Luckily, assigning real hardware can prevent migration or suspend, so we
won't need to adjust the offset during runtime. TSC is a generally
unmigratable device that just happens to live on the CPU.
(It would have been better to hide TSC capability from the guest and only
use rdtsc for kvmclock if the guest wanted fancy features.)
> I think it's also screwed if you migrate to a machine with a different
> ratio of guest TSC ticks to host ART ticks or a different offset,
> because the host isn't going to do the rdmsr every time it tries to
> access the ART, so passing it through might require a paravirt
> mechanism no matter what.
It's almost certain that the other host will have a different offset,
which makes TSC unmigratable in software without even considering ART
or frequencies. Well, KVM already emulates different TSC frequency, so
we could emulate ART without sinking much lower. :)
> ISTM that, if KVM tries to keep the guest TSC monotonic across
> migration, it should probably also keep it monotonic across host
> suspend/resume.
Yes, "Pausing" TSC during suspend or migration is one way of improving
the TSC estimate. If we want to emulate ART, then the estimate is
noticeably lacking, because TSC and ART are defined by a simple
equation (SDM 2015-12, 17.14.4 Invariant Time-Keeping):
TSC_Value = (ART_Value * CPUID.15H:EBX[31:0] )/ CPUID.15H:EAX[31:0] + K
where the guest thinks that CPUID and K are constant (between events
that the guest knows of), so we should give the best estimate of how
many TSC cycles have passed. (The best estimate is still lacking.)
> After all, host suspend/resume is kind of like
> migrating from the pre-suspend host to the post-resume host. Maybe it
> could even share code.
Hopefully ... host suspend/resume is driven by kernel and migration is
driven by userspace, which might complicate sharing.