Re: [PATCH v2] KVM: SVM: Delay restoration of host MSR_TSC_AUX until return to userspace
From: Sean Christopherson
Date: Thu Apr 22 2021 - 16:12:43 EST
+Tom
On Thu, Apr 22, 2021, Reiji Watanabe wrote:
> @@ -2893,12 +2882,15 @@ static int svm_set_msr(struct kvm_vcpu *vcpu,
> struct msr_data *msr)
> return 1;
>
> /*
> - * This is rare, so we update the MSR here instead of using
> - * direct_access_msrs. Doing that would require a rdmsr in
> - * svm_vcpu_put.
> + * TSC_AUX is usually changed only during boot and never read
> + * directly. Intercept TSC_AUX instead of exposing it to the
> + * guest via direct_acess_msrs, and switch it via user return.
> */
>
> 'direct_acess_msrs' should be 'direct_access_msrs'.
>
>
> svm->tsc_aux = data;
> - wrmsrl(MSR_TSC_AUX, svm->tsc_aux);
> +
> + preempt_disable();
> + kvm_set_user_return_msr(TSC_AUX_URET_SLOT, data, -1ull);
> + preempt_enable();
> break;
>
> One of the callers of svm_set_msr() is kvm_arch_vcpu_ioctl(KVM_SET_MSRS).
> Since calling kvm_set_user_return_msr() looks unnecessary for the ioctl
> case and makes extra things for the CPU to do when the CPU returns to
> userspace for the case, I'm wondering if it might be better to check
> svm->guest_state_loaded before calling kvm_set_user_return_msr() here.
Ugh. AMD support for MSR_TSC_AUX is a train wreck.
Looking at VMX again, the reason it calls kvm_set_user_return_msr()
unconditionally is so that it can check the result of the WRMSR and inject #GP
into the guest if the WRMSR failed.
At first blush, that would appear to be unnecessary since host support for
MSR_TSC_AUX was already confirmed. But on Intel, bits 63:32 of MSR_TSC_AUX are
actually "reserved", so in theory this code should not check guest_state_loaded,
but instead check the result and inject #GP as appropriate.
However, I put "reserved" in quotes because AMD CPUs apparently don't actually
do a reserved check. Sadly, the APM doesn't say _anything_ about those bits in
the context of MSR access, the RDTSCP entry simply states that RCX contains bits
31:0 of the MSR, zero extended. But even worse, the RDPID description implies
that it can consume all 64 bits of the MSR:
RDPID reads the value of TSC_AUX MSR used by the RDTSCP instruction into the
specified destination register. Normal operand size prefixes do not apply and
the update is either 32 bit or 64 bit based on the current mode.
Intel has identical behavior for RDTSCP and RDPID, but because Intel CPUs
actually prevent software from writing bits 63:32, they haven't shot themselves
in the proverbial foot.
All that said, the AMD behavior of dropping the bits across the board actually
plays into KVM's favor because we can leverage it to support cross-vendor
migration. I.e. explicitly drop svm->tsc_aux[63:32] on read (or clear on write),
that way userspace doesn't have to fudge the value itself when migrating an
adventurous guest from AMD to Intel.
Lastly, this flow is also missing a check on guest_cpuid_has().
All in all, I think we want this:
case MSR_TSC_AUX:
if (!boot_cpu_has(X86_FEATURE_RDTSCP))
return 1;
if (!msr_info->host_initiated &&
!guest_cpuid_has(vcpu, X86_FEATURE_RDTSCP))
return 1;
/*
* TSC_AUX is usually changed only during boot and never read
* directly. Intercept TSC_AUX instead of exposing it to the
* guest via direct_access_msrs, and switch it via user return.
*/
preempt_disable();
r = kvm_set_user_return_msr(TSC_AUX_URET_SLOT, data, -1ull);
preempt_enable();
if (r)
return 1;
/*
* Bits 63:32 are dropped by AMD CPUs, but are reserved on
* Intel CPUs. AMD's APM has incomplete and conflicting info
* on the architectural behavior; emulate current hardware as
* doing so ensures migrating from AMD to Intel won't explode.
*/
svm->tsc_aux = (u32)data;
break;