Re: [PATCH 1/5] x86/kvm: On KVM re-enable (e.g. after suspend), update clocks

From: Andy Lutomirski
Date: Thu Mar 17 2016 - 14:23:08 EST


On Mar 17, 2016 8:10 AM, "Radim Krcmar" <rkrcmar@xxxxxxxxxx> wrote:
>
> 2016-03-16 16:07-0700, Andy Lutomirski:
> > On Wed, Mar 16, 2016 at 3:59 PM, Radim Krcmar <rkrcmar@xxxxxxxxxx> wrote:
> >> 2016-03-16 15:15-0700, Andy Lutomirski:
> >>> FWIW, if you ever intend to support ART ("always running timer")
> >>> passthrough, this is going to be a giant clusterfsck. Good luck. I
> >>> haven't gotten a straight answer as to what hardware actually supports
> >>> that thing, so even testing isn't no easy.
> >>
> >> Hm, AR TSC would be best handled by doing nothing ... dropping the
> >> faking logic just became tempting.
>
> ART is different from what I initially thought, it's the underlying
> mechanism for invariant TSC and nothing more ... we already forbid
> migrations when the guest knows about invariant TSC, so we could do the
> same and let ART be virtualized. (Suspend has to be forbidden too.)

It's more than that -- it's a TSC-like clock that can be read by PCIe devices.

>
> > As it stands, ART is screwed if you adjust the VMCS's tsc offset. But
>
> Luckily, assigning real hardware can prevent migration or suspend, so we
> won't need to adjust the offset during runtime. TSC is a generally
> unmigratable device that just happens to live on the CPU.
>
> (It would have been better to hide TSC capability from the guest and only
> use rdtsc for kvmclock if the guest wanted fancy features.)
>

I think that, if KVM passes through an ART-supporting NIC, it might be
rather messy to try to avoid passing through TSC as well. But maybe a
pvclock-like structure could expose the ART-kvmclock offset and scale.

> > I think it's also screwed if you migrate to a machine with a different
> > ratio of guest TSC ticks to host ART ticks or a different offset,
> > because the host isn't going to do the rdmsr every time it tries to
> > access the ART, so passing it through might require a paravirt
> > mechanism no matter what.
>
> It's almost certain that the other host will have a different offset,
> which makes TSC unmigratable in software without even considering ART
> or frequencies. Well, KVM already emulates different TSC frequency, so
> we could emulate ART without sinking much lower. :)
>
> > ISTM that, if KVM tries to keep the guest TSC monotonic across
> > migration, it should probably also keep it monotonic across host
> > suspend/resume.
>
> Yes, "Pausing" TSC during suspend or migration is one way of improving
> the TSC estimate. If we want to emulate ART, then the estimate is
> noticeably lacking, because TSC and ART are defined by a simple
> equation (SDM 2015-12, 17.14.4 Invariant Time-Keeping):
> TSC_Value = (ART_Value * CPUID.15H:EBX[31:0] )/ CPUID.15H:EAX[31:0] + K
>
> where the guest thinks that CPUID and K are constant (between events
> that the guest knows of), so we should give the best estimate of how
> many TSC cycles have passed. (The best estimate is still lacking.)
>
> > After all, host suspend/resume is kind of like
> > migrating from the pre-suspend host to the post-resume host. Maybe it
> > could even share code.
>
> Hopefully ... host suspend/resume is driven by kernel and migration is
> driven by userspace, which might complicate sharing.

Good point.

--Andy