Re: VMX Preemption Timer appears to be buggy on SKX, CLX, and ICX
From: Yuan Yao
Date: Mon Jul 08 2024 - 01:56:17 EST
On Wed, Jul 03, 2024 at 01:14:09PM -0700, Reinette Chatre wrote:
> Hi Sean,
>
> On 6/28/24 5:39 PM, Sean Christopherson wrote:
> > Forking this off to try and avoid confusion...
> >
> > On Wed, Jun 12, 2024, Reinette Chatre wrote:
>
> ...
>
> > > +
> > > + freq = (tmict - tmcct) * tdcrs[i].divide_count * tsc_hz / (tsc1 - tsc0);
> > > + /* Check if measured frequency is within 1% of configured frequency. */
> > > + GUEST_ASSERT(freq < apic_hz * 101 / 100);
> > > + GUEST_ASSERT(freq > apic_hz * 99 / 100);
> > > + }
> >
> > This test fails on our SKX, CLX, and ICX systems due to what appears to be a CPU
> > bug. It looks like something APICv related is clobbering internal VMX timer state?
> > Or maybe there's a tearing or truncation issue?
>
> It has been a few days. Just a note to let you know that we are investigating this.
> On my side I have not yet been able to reproduce this issue. I tested
> kvm-x86-next-2024.06.28 on an ICX and an CLX system by running 100 iterations of
> apic_bus_clock_test and they all passed. Since I have lack of experience here there are
> some Intel virtualization experts helping out with this investigation and I hope that
> they will be some insights from the analysis and testing that you already provided.
>
> Reinette
>
I can reproduce this on my side, even with apicv disabled by
'insmod $kernel_path/arch/x86/kvm/kvm-intel.ko enable_apicv=N'.
@Sean I think we're observing same issue, please see the details
below:
apic_bus_clock_test can't reproduce this may because the
preemption timer value calculation relies on apic bus
frequency/TMICT/Divide count/host TSC frequency and ratio
between preemption timer and host TSC frequency, too many
factors to generate the 'magic' value there. So I changed
KVM and added a small KVM kselftest tool to set the
preemption timer value directly from guest, this makes the
reproducing easily. The changes are attached at end of this
comment.
The trace I captured below came form host with 1.7GHz TSC,
the VM_EXIT_SAVE_VMX_PREEMPTION_TIMER is enabled to get the
cpu saved vmcs.VMX_PREEMPTION_TIMER_VALUE after VMEXIT. I
set the vmcs.VMX_PREEMPTION_TIMER_VALUE to 0x880042ad which
is the 'magic' number on this 1.7Ghz TSC machine:
preempt_test 20677.199521: kvm:kvm_vmx_debug: kvm_vmx_debug: 3, a0:0x77fd5554 a1:0x880042ad a2:0x880042ad a3:0x20462e98d9b9
a0: The previous vmcs.VMX_PREEMPTION_TIMER_VALUE value
saved by CPU when VMEXIT.
a1: The new preemption timer value wrote to
vmcs.VMX_PREEMPTION_TIMER_VALUE.
a2: The value read back from
vmcs.VMX_PREEMPTION_TIMER_VALUE, for double confirmation.
a3: The host tsc at the time point, debug only.
preempt_test 20677.199579: kvm:kvm_exit: reason PREEMPTION_TIMER rip 0x40274d info 0 0 intr 0
preempt_test 20677.199579: kvm:kvm_vmx_debug: kvm_vmx_debug: 2, a0:0x34 a1:0x0 a2:0x87fea9b0 a3:0x20462e9a5749
a0: The VMEXIT reason, 0x34 is preemption timer VMEXIT.
a1: The read back vmcs.VMX_PREEMPTION_TIMER_VALUE value, here 0.
a2: The next preemption timer value should be written to
vmcs, calculates from the (target tsc - current tsc) >>
7. Now the preemption timer vmexit happend only after
~58 microseconds elapsed, it should happen after
~171.79 seconds but not such soon, the issue is
reproduced.
Another more easy way to observe this symptom w/o care the
'magic' preemption timer vlaue is use the maximum preemption
timer value 0xffffffff, below log w/ 0xffffffff is captured
from same machine:
preempt_test 20530.456589: kvm:kvm_vmx_debug: kvm_vmx_debug: 3, a0:0x77fd5551 a1:0xffffffff a2:0xffffffff a3:0x200c1971ca5d
a0: The previous vmcs.VMX_PREEMPTION_TIMER_VALUE value
saved by CPU when VMEXIT.
a1: The new preemption timer value wrote to
vmcs.VMX_PREEMPTION_TIMER_VALUE.
a2: The read back value from
vmcs.VMX_PREEMPTION_TIMER_VALUE, double confirmation.
a3: the host tsc at the time point, debug only.
preempt_test 20530.456690: kvm:kvm_exit: reason VMCALL rip 0x4131a0 info 0 0 intr 0
The preempt_test checks preemption timer state every
100us, this VMEXIT is expected behavior.
preempt_test 20530.456691: kvm:kvm_entry: vcpu 0, rip 0x4131a3
preempt_test 20530.456691: kvm:kvm_vmx_debug: kvm_vmx_debug: 3, a0:0x77ff82cc a1:0xfffe900b a2:0xfffe900b a3:0x200c19746d45
a0: The previous vmcs.VMX_PREEMPTION_TIMER_VALUE value
saved by CPU when VMEXIT. The difference value
shouldn't be such huge number 0xffffffff - 0x77ff82cc
= 0x88007D33 when just ~100us elapsed from previous
VMENTRY, the issue is reproduced.
Use 0x88000000 as preemption timer value to verify this
preempt_test tool, the preemption timer VMEXIT happend after
~171.45 seconds which is expected behavior on the host
1.7Ghz TSC system:
The preemption timer VMEXIT should happen after:
2281701376 × 128 / 1700000000 = 171.79 seconds.
Attached my changes in KVM and kselftest tool for
reproducing here, based on:
https://github.com/kvm-x86/linux.git
tag:kvm-x86-next-2024.06.28
Patch 01: