Re: [PATCH v3 1/3] kvm: Introduce kvm_total_suspend_ns().

From: Sean Christopherson
Date: Wed Feb 05 2025 - 20:30:05 EST


On Wed, Feb 05, 2025, Suleiman Souhlal wrote:
> On Tue, Feb 4, 2025 at 4:58 PM Suleiman Souhlal <suleiman@xxxxxxxxxx> wrote:
> > > Given that s2idle and standby don't reset host TSC, I think the right way to
> > > handle this conundrum is to address the flaw that's noted in the "backwards TSC"
> > > logic, and adjust guest TSC to be fully up-to-date in the S3 (or lower) case.
> > >
> > > * ...................................... Unfortunately, we can't
> > > * bring the TSCs fully up to date with real time, as we aren't yet far
> > > * enough into CPU bringup that we know how much real time has actually
> > > * elapsed; our helper function, ktime_get_boottime_ns() will be using boot
> > > * variables that haven't been updated yet.
> > >
> > > I have no idea why commit 0dd6a6edb012 ("KVM: Dont mark TSC unstable due to S4
> > > suspend") hooked kvm_arch_enable_virtualization_cpu() instead of implementing a
> > > PM notifier, but I don't see anything that suggests it was deliberate, i.e. that
> > > KVm *needs* to effectively snapshot guest TSC when onlining CPUs.
> > >
> > > If that wart is fixed, then both kvmclock and TSC will account host suspend time,
> > > and KVM can safely account the suspend time into steal time regardless of which
> > > clock(s) the guest is using.
> >
> > I tried your suggestion of moving this to a PM notifier and I found
> > that it's possible for VCPUs to run after resume but before the PM
> > notifier has been called, because the resume notifiers get called
> > after tasks are unfrozen. Unfortunately that means that if we were to
> > do that, guest TSCs could go backwards.

Ugh. That explains why KVM hooks the CPU online path.

> > However, I think it should be possible to keep the existing backwards
> > guest TSC prevention code but also use a notifier that further adjusts
> > the guest TSCs to advance time on suspends where the TSC did go
> > backwards. This would make both s2idle and deep suspends behave the
> > same way.
>
> An alternative might be to block VCPUs from newly entering the guest
> between the pre and post suspend notifiers.
> Otherwise, some of the steal time accounting would have to be done in
> kvm_arch_enable_virtualization_cpu(), to make sure it gets applied on
> the first VCPU run, in case that happens before the resume notifier
> would have fired. But the comment there says we can't call
> ktime_get_boottime_ns() there, so maybe that's not possible.

I don't think the PM notifier approach is viable. It's simply too late. Without
a hook in CPU online, KVM can't even tell which VMs/vCPUs were running before the
suspend, i.e. which VMs need to be updated.

One idea would be to simply fast forward guest TSC to current time when the vCPU
is loaded after suspend+resume. E.g. this hack appears to work.

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index dcd0c12c308e..27b25f8ca413 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4824,7 +4824,16 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)

/* Apply any externally detected TSC adjustments (due to suspend) */
if (unlikely(vcpu->arch.tsc_offset_adjustment)) {
- adjust_tsc_offset_host(vcpu, vcpu->arch.tsc_offset_adjustment);
+ u64 kernel_ns, tsc_now;
+
+ if (kvm_get_time_and_clockread(&kernel_ns, &tsc_now)) {
+ u64 l1_tsc = nsec_to_cycles(vcpu, vcpu->kvm->arch.kvmclock_offset + kernel_ns);
+ u64 offset = kvm_compute_l1_tsc_offset(vcpu, l1_tsc);
+
+ kvm_vcpu_write_tsc_offset(vcpu, offset);
+ } else {
+ adjust_tsc_offset_host(vcpu, vcpu->arch.tsc_offset_adjustment);
+ }
vcpu->arch.tsc_offset_adjustment = 0;
kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
}