Re: [PATCH RFC 1/1] KVM: x86: add param to update master clock periodically

From: David Woodhouse
Date: Fri Oct 13 2023 - 14:21:58 EST


On Fri, 2023-10-13 at 11:07 -0700, Sean Christopherson wrote:
> On Wed, Oct 11, 2023, David Woodhouse wrote:
> > On Tue, 2023-10-10 at 17:20 -0700, Sean Christopherson wrote:
> > > On Wed, Oct 04, 2023, Dongli Zhang wrote:
> > > > > -static void kvm_gen_kvmclock_update(struct kvm_vcpu *v)
> > > > > -{
> > > > > -       struct kvm *kvm = v->kvm;
> > > > > -
> > > > > -       kvm_make_request(KVM_REQ_CLOCK_UPDATE, v);
> > > > > -       schedule_delayed_work(&kvm->arch.kvmclock_update_work,
> > > > > -                                       KVMCLOCK_UPDATE_DELAY);
> > > > > -}
> > > > > -
> > > > >  #define KVMCLOCK_SYNC_PERIOD (300 * HZ)
> > > >
> > > > While David mentioned "maximum delta", how about to turn above into a module
> > > > param with the default 300HZ.
> > > >
> > > > BTW, 300HZ should be enough for vCPU hotplug case, unless people prefer 1-hour
> > > > or 1-day.
> > >
> > > Hmm, I think I agree with David that it would be better if KVM can take care of
> > > the gory details and promise a certain level of accuracy.  I'm usually a fan of
> > > punting complexity to userspace, but requiring every userspace to figure out the
> > > ideal sync frequency on every platform is more than a bit unfriendly.  And it
> > > might not even be realistic unless userspace makes assumptions about how the kernel
> > > computes CLOCK_MONOTONIC_RAW from TSC cycles.
> > >
> >
> > I think perhaps I would rather save up my persuasiveness on the topic
> > of "let's not make things too awful for userspace to cope with" for the
> > live update/migration mess. I think I need to dust off that attempt at
> > fixing our 'how to migrate with clocks intact' documentation from
> > https://lore.kernel.org/kvm/13f256ad95de186e3b6bcfcc1f88da5d0ad0cb71.camel@xxxxxxxxxxxxx/
> > The changes we're discussing here obviously have an effect on migration
> > too.
> >
> > Where the host TSC is actually reliable, I would really prefer for the
> > kvmclock to just be a fixed function of the guest TSC and *not* to be
> > arbitrarily yanked back[1] to the host's CLOCK_MONOTONIC periodically.
>
> CLOCK_MONOTONIC_RAW!  Just wanted to clarify because if kvmclock were tied to the
> non-raw clock, then we'd have to somehow reconcile host NTP updates.

Sorry, yes. CLOCK_MONOTONIC_RAW. That was just a typo in email.

Of course we'd never try to use CLOCK_MONOTONIC; that would be daft.
We'd never do that. Just like we'd never do something daft like using
CLOCK_REALTIME instead of CLOCK_TAI for the KVM_CLOCK_REALTIME pairing
in __get_kvmclock()... oh.

Shit.

> I generally support the idea, but I think it needs to an opt-in from userspace.
> Essentially a "I pinky swear to give all vCPUs the same TSC frequency, to not
> suspend the host, and to not run software/firmware that writes IA32_TSC_ADJUST".
> AFAICT, there are too many edge cases and assumptions about userspace for KVM to
> safely couple kvmclock to guest TSC by default.

I think IA32_TSC_ADJUST is OK, isn't it? There is a "real" TSC value
and if vCPUs adjust themselves forward and backwards from that, it's
just handled as a delta.

And we solved 'give all vCPUS the same TSC frequency' by making that
KVM-wide.

Maybe suspending and resuming the host can be treated like live
migration, where you know the host TSC is different so you have to make
do with a delta based on CLOCK_TAI.

But while I'm picking on the edge cases and suggesting that we *can*
cope with some of them, I do agree with your suggestion that "let
kvmclock run by itself without being clamped back to
CLOCK_MONOTONIC_RAW" should be an opt *in* feature.

> > [1] Yes, I believe "back" does happen. I have test failures in my queue
> > to look at, where guests see the "Xen" clock going backwards.
>
> Yeah, I assume "back" can happen based purely on the wierdness of the pvclock math.o
>
> What if we add a module param to disable KVM's TSC synchronization craziness
> entirely?  If we first clean up the peroidic sync mess, then it seems like it'd
> be relatively straightforward to let kill off all of the synchronization, including
> the synchronization of kvmclock to the host's TSC-based CLOCK_MONOTONIC_RAW.
>
> Not intended to be a functional patch...

Will stare harder at the actual patch when it isn't Friday night.

In the meantime, I do think a KVM cap that the VMM opts into is better
than a module param?

> ---
>  arch/x86/kvm/x86.c | 35 ++++++++++++++++++++++++++++++++---
>  1 file changed, 32 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 5b2104bdd99f..75fc6cbaef0d 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -157,6 +157,9 @@ module_param(min_timer_period_us, uint, S_IRUGO | S_IWUSR);
>  static bool __read_mostly kvmclock_periodic_sync = true;
>  module_param(kvmclock_periodic_sync, bool, S_IRUGO);
>  
> +static bool __read_mostly enable_tsc_sync = true;
> +module_param_named(tsc_synchronization, enable_tsc_sync, bool, 0444);
> +
>  /* tsc tolerance in parts per million - default to 1/2 of the NTP threshold */
>  static u32 __read_mostly tsc_tolerance_ppm = 250;
>  module_param(tsc_tolerance_ppm, uint, S_IRUGO | S_IWUSR);
> @@ -2722,6 +2725,12 @@ static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 data)
>         bool matched = false;
>         bool synchronizing = false;
>  
> +       if (!enable_tsc_sync) {
> +               offset = kvm_compute_l1_tsc_offset(vcpu, data);
> +               kvm_vcpu_write_tsc_offset(vcpu, offset);
> +               return;
> +       }
> +
>         raw_spin_lock_irqsave(&kvm->arch.tsc_write_lock, flags);
>         offset = kvm_compute_l1_tsc_offset(vcpu, data);
>         ns = get_kvmclock_base_ns();
> @@ -2967,9 +2976,12 @@ static void pvclock_update_vm_gtod_copy(struct kvm *kvm)
>                                         &ka->master_kernel_ns,
>                                         &ka->master_cycle_now);
>  
> -       ka->use_master_clock = host_tsc_clocksource && vcpus_matched
> -                               && !ka->backwards_tsc_observed
> -                               && !ka->boot_vcpu_runs_old_kvmclock;
> +       WARN_ON_ONCE(!host_tsc_clocksource && !enable_tsc_sync);
> +
> +       ka->use_master_clock = host_tsc_clocksource &&
> +                              (vcpus_matched || !enable_tsc_sync) &&
> +                              !ka->backwards_tsc_observed &&
> +                              !ka->boot_vcpu_runs_old_kvmclock;
>  
>         if (ka->use_master_clock)
>                 atomic_set(&kvm_guest_has_master_clock, 1);
> @@ -3278,6 +3290,9 @@ static void kvmclock_sync_fn(struct work_struct *work)
>  
>  void kvm_adjust_pv_clock_users(struct kvm *kvm, bool add_user)
>  {
> +       if (!enable_tsc_sync)
> +               return;
> +
>         /*
>          * Doesn't need to be a spinlock, but can't be kvm->lock as this is
>          * call while holding a vCPU's mutext.
> @@ -5528,6 +5543,11 @@ static int kvm_arch_tsc_set_attr(struct kvm_vcpu *vcpu,
>                 if (get_user(offset, uaddr))
>                         break;
>  
> +               if (!enable_tsc_sync) {
> +                       kvm_vcpu_write_tsc_offset(vcpu, offset);
> +                       break;
> +               }
> +
>                 raw_spin_lock_irqsave(&kvm->arch.tsc_write_lock, flags);
>  
>                 matched = (vcpu->arch.virtual_tsc_khz &&
> @@ -12188,6 +12208,9 @@ int kvm_arch_hardware_enable(void)
>         if (ret != 0)
>                 return ret;
>  
> +       if (!enable_tsc_sync)
> +               return 0;
> +
>         local_tsc = rdtsc();
>         stable = !kvm_check_tsc_unstable();
>         list_for_each_entry(kvm, &vm_list, vm_list) {
> @@ -13670,6 +13693,12 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_vmgexit_msr_protocol_exit);
>  
>  static int __init kvm_x86_init(void)
>  {
> +       if (!boot_cpu_has(X86_FEATURE_CONSTANT_TSC))
> +               enable_tsc_sync = true;
> +
> +       if (!enable_tsc_sync)
> +               kvmclock_periodic_sync = false;
> +
>         kvm_mmu_x86_module_init();
>         mitigate_smt_rsb &= boot_cpu_has_bug(X86_BUG_SMT_RSB) && cpu_smt_possible();
>         return 0;
>
> base-commit: 7d2edad0beb2a6f07f6e6c2d477d5874f5417d6c

Attachment: smime.p7s
Description: S/MIME cryptographic signature