Re: [PATCH v5 untested] kvm: better MWAIT emulation for guests

From: Radim KrÄmÃÅ
Date: Thu Mar 16 2017 - 13:39:10 EST


2017-03-16 19:14+0200, Michael S. Tsirkin:
> On Thu, Mar 16, 2017 at 12:54:50PM -0400, Gabriel L. Somlo wrote:
> > On Thu, Mar 16, 2017 at 12:52:32PM -0400, Gabriel L. Somlo wrote:
> > > On Thu, Mar 16, 2017 at 06:45:02PM +0200, Michael S. Tsirkin wrote:
> > > > On Thu, Mar 16, 2017 at 12:16:13PM -0400, Gabriel L. Somlo wrote:
> > > > > On Thu, Mar 16, 2017 at 04:35:18PM +0100, Radim KrÄmÃÅ wrote:
> > > > > > 2017-03-16 10:58-0400, Gabriel L. Somlo:
> > > > > > > On Thu, Mar 16, 2017 at 04:04:12PM +0200, Michael S. Tsirkin wrote:
> > > > > > > > On Thu, Mar 16, 2017 at 09:24:27AM -0400, Gabriel L. Somlo wrote:
> > > > > > > > > After studying your patch a bit more carefully (sorry, it's crazy
> > > > > > > > > around here right now :) ) I realized you're simply trying to
> > > > > > > > > (selectively) decide when to exit L1 and emulate as NOP vs. when to
> > > > > > > > > just allow L1 to execute MONITOR & MWAIT natively.
> > > > > > > > >
> > > > > > > > > Is that right ? Because if so, the issues I saw on my MacPro1,1 are
> > > > > > > > > weird and inexplicable, given that allowing L>=1 to run MONITOR/MWAIT
> > > > > > > > > natively was one of the options Alex Graf and Rene Rebe used back in
> > > > > > > > > the very early days of OS X on QEMU, at the time I got involved with
> > > > > > > > > that project. Here's part of an out of tree patch against 3.4 which did
> > > > > > > > > just that, and worked as far as I remember on *any* MWAIT capable
> > > > > > > > > intel chip I had access to back in 2010:
> > > > > > > > >
> > > > > > > > > ##############################################################################
> > > > > > > > > # 99-mwait.patch.kvm-kmod (Rene Rebe <rene@xxxxxxxxxxxx>) 2010-04-27
> > > > > > > > > ##############################################################################
> > > > > > > > > diff -pNarU5 linux-3.4/arch/x86/kvm/cpuid.c linux-3.4-mac/arch/x86/kvm/cpuid.c
> > > > > > > > > --- linux-3.4/arch/x86/kvm/cpuid.c 2012-05-20 18:29:13.000000000 -0400
> > > > > > > > > +++ linux-3.4-mac/arch/x86/kvm/cpuid.c 2012-10-09 11:42:59.921215750 -0400
> > > > > > > > > @@ -222,11 +222,11 @@ static int do_cpuid_ent(struct kvm_cpuid
> > > > > > > > > f_nx | 0 /* Reserved */ | F(MMXEXT) | F(MMX) |
> > > > > > > > > F(FXSR) | F(FXSR_OPT) | f_gbpages | f_rdtscp |
> > > > > > > > > 0 /* Reserved */ | f_lm | F(3DNOWEXT) | F(3DNOW);
> > > > > > > > > /* cpuid 1.ecx */
> > > > > > > > > const u32 kvm_supported_word4_x86_features =
> > > > > > > > > - F(XMM3) | F(PCLMULQDQ) | 0 /* DTES64, MONITOR */ |
> > > > > > > > > + F(XMM3) | F(PCLMULQDQ) | F(MWAIT) /* DTES64, MONITOR */ |
> > > > > > > > > 0 /* DS-CPL, VMX, SMX, EST */ |
> > > > > > > > > 0 /* TM2 */ | F(SSSE3) | 0 /* CNXT-ID */ | 0 /* Reserved */ |
> > > > > > > > > F(FMA) | F(CX16) | 0 /* xTPR Update, PDCM */ |
> > > > > > > > > 0 /* Reserved, DCA */ | F(XMM4_1) |
> > > > > > > > > F(XMM4_2) | F(X2APIC) | F(MOVBE) | F(POPCNT) |
> > > > > > > > > diff -pNarU5 linux-3.4/arch/x86/kvm/svm.c linux-3.4-mac/arch/x86/kvm/svm.c
> > > > > > > > > --- linux-3.4/arch/x86/kvm/svm.c 2012-05-20 18:29:13.000000000 -0400
> > > > > > > > > +++ linux-3.4-mac/arch/x86/kvm/svm.c 2012-10-09 11:44:41.598997481 -0400
> > > > > > > > > @@ -1102,12 +1102,10 @@ static void init_vmcb(struct vcpu_svm *s
> > > > > > > > > set_intercept(svm, INTERCEPT_VMSAVE);
> > > > > > > > > set_intercept(svm, INTERCEPT_STGI);
> > > > > > > > > set_intercept(svm, INTERCEPT_CLGI);
> > > > > > > > > set_intercept(svm, INTERCEPT_SKINIT);
> > > > > > > > > set_intercept(svm, INTERCEPT_WBINVD);
> > > > > > > > > - set_intercept(svm, INTERCEPT_MONITOR);
> > > > > > > > > - set_intercept(svm, INTERCEPT_MWAIT);
> > > > > > > > > set_intercept(svm, INTERCEPT_XSETBV);
> > > > > > > > >
> > > > > > > > > control->iopm_base_pa = iopm_base;
> > > > > > > > > control->msrpm_base_pa = __pa(svm->msrpm);
> > > > > > > > > control->int_ctl = V_INTR_MASKING_MASK;
> > > > > > > > > diff -pNarU5 linux-3.4/arch/x86/kvm/vmx.c linux-3.4-mac/arch/x86/kvm/vmx.c
> > > > > > > > > --- linux-3.4/arch/x86/kvm/vmx.c 2012-05-20 18:29:13.000000000 -0400
> > > > > > > > > +++ linux-3.4-mac/arch/x86/kvm/vmx.c 2012-10-09 11:42:59.925215977 -0400
> > > > > > > > > @@ -1938,11 +1938,11 @@ static __init void nested_vmx_setup_ctls
> > > > > > > > > nested_vmx_procbased_ctls_low, nested_vmx_procbased_ctls_high);
> > > > > > > > > nested_vmx_procbased_ctls_low = 0;
> > > > > > > > > nested_vmx_procbased_ctls_high &=
> > > > > > > > > CPU_BASED_VIRTUAL_INTR_PENDING | CPU_BASED_USE_TSC_OFFSETING |
> > > > > > > > > CPU_BASED_HLT_EXITING | CPU_BASED_INVLPG_EXITING |
> > > > > > > > > - CPU_BASED_MWAIT_EXITING | CPU_BASED_CR3_LOAD_EXITING |
> > > > > > > > > + CPU_BASED_CR3_LOAD_EXITING |
> > > > > > > > > CPU_BASED_CR3_STORE_EXITING |
> > > > > > > > > #ifdef CONFIG_X86_64
> > > > > > > > > CPU_BASED_CR8_LOAD_EXITING | CPU_BASED_CR8_STORE_EXITING |
> > > > > > > > > #endif
> > > > > > > > > CPU_BASED_MOV_DR_EXITING | CPU_BASED_UNCOND_IO_EXITING |
> > > > > > > > > @@ -2404,12 +2404,10 @@ static __init int setup_vmcs_config(stru
> > > > > > > > > CPU_BASED_CR3_LOAD_EXITING |
> > > > > > > > > CPU_BASED_CR3_STORE_EXITING |
> > > > > > > > > CPU_BASED_USE_IO_BITMAPS |
> > > > > > > > > CPU_BASED_MOV_DR_EXITING |
> > > > > > > > > CPU_BASED_USE_TSC_OFFSETING |
> > > > > > > > > - CPU_BASED_MWAIT_EXITING |
> > > > > > > > > - CPU_BASED_MONITOR_EXITING |
> > > > > > > > > CPU_BASED_INVLPG_EXITING |
> > > > > > > > > CPU_BASED_RDPMC_EXITING;
> > > > > > > > >
> > > > > > > > > opt = CPU_BASED_TPR_SHADOW |
> > > > > > > > > CPU_BASED_USE_MSR_BITMAPS |
> > > > > > > > >
> > > > > > > > > If all you're trying to do is (selectively) revert to this behavior,
> > > > > > > > > that "shouldn't" mess it up for the MacPro either, so I'm thoroughly
> > > > > > > > > confused at this point :)
> > > > > > > >
> > > > > > > > Yes. Me too. Want to try that other patch and see what happens?
> > > > > > >
> > > > > > > You mean the old 3.4 patch against current KVM ? I'll try to do that,
> > > > > > > might take me a while :)
> > > > > >
> > > > > > Michael's patch already did most of that, you just need to add
> > > > > >
> > > > > > diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> > > > > > index efde6cc50875..b12f07d4ce17 100644
> > > > > > --- a/arch/x86/kvm/cpuid.c
> > > > > > +++ b/arch/x86/kvm/cpuid.c
> > > > > > @@ -348,7 +348,7 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
> > > > > > const u32 kvm_cpuid_1_ecx_x86_features =
> > > > > > /* NOTE: MONITOR (and MWAIT) are emulated as NOP,
> > > > > > * but *not* advertised to guests via CPUID ! */
> > > > > > - F(XMM3) | F(PCLMULQDQ) | 0 /* DTES64, MONITOR */ |
> > > > > > + F(XMM3) | F(PCLMULQDQ) | F(MWAIT) /* DTES64, MONITOR */ |
> > > > > > 0 /* DS-CPL, VMX, SMX, EST */ |
> > > > > > 0 /* TM2 */ | F(SSSE3) | 0 /* CNXT-ID */ | 0 /* Reserved */ |
> > > > > > F(FMA) | F(CX16) | 0 /* xTPR Update, PDCM */ |
> > > > > >
> > > > > > Note: this will never be upstream, because mwait isn't what we want by
> > > > > > default. :)
> > > > >
> > > > > But since OS X doesn't check CPUID and simply runs MONITOR & MWAIT
> > > > > assuming they're present, the above one-liner would make no
> > > > > difference. If everything else in the old patch I quoted is identical
> > > > > to what Michael does, then I don't know -- maybe the MacPro1,1 has
> > > > > really broken L>=1 MWAIT, and it only ever worked with vmexit and
> > > > > emulation on the host side.
> > > >
> > > >
> > > > I think I have an idea. It is probably one of the monitor bugs
> > > > on this host.
> > > >
> > > > X86_BUG_CLFLUSH_MONITOR or X86_BUG_MONITOR.
> > > >
> > > > If you tell guest you have a CPU that does not need it
> > > > but host does need it, then mwait will not work.
> > > >
> > > > if (c->x86 == 6 && boot_cpu_has(X86_FEATURE_CLFLUSH) &&
> > > > (c->x86_model == 29 || c->x86_model == 46 || c->x86_model == 47))
> > > > set_cpu_bug(c, X86_BUG_CLFLUSH_MONITOR);
> > > >
> > > >
> > > > if (c->x86 == 6 && boot_cpu_has(X86_FEATURE_MWAIT) &&
> > > > ((c->x86_model == INTEL_FAM6_ATOM_GOLDMONT)))
> > > > set_cpu_bug(c, X86_BUG_MONITOR);
> > > >
> > > > what did you say your host model is?
> > >
> > > # dmidecode -t1
> > > # dmidecode 2.12
> > > SMBIOS 2.4 present.
> > >
> > > Handle 0x0021, DMI type 1, 27 bytes
> > > System Information
> > > Manufacturer: Apple Computer, Inc.
> > > Product Name: MacPro1,1
> > > Version: 1.0
> > > Serial Number: G87030UEUPZ
> > > UUID: 9CFE245E-D0C8-BD45-A79F-54EA5FBD3D97
> > > Wake-up Type: Power Switch
> > > SKU Number: System SKU#
> > > Family: MacPro
> >
> > And, probably more usefully:
> >
> > [... ommitting 0,1,2 ...]
> >
> > processor : 3
> > vendor_id : GenuineIntel
> > cpu family : 6
> > model : 15
> > model name : Intel(R) Xeon(R) CPU 5150 @ 2.66GHz
> > stepping : 6
> > microcode : 0xd2
> > cpu MHz : 2659.998
> > cache size : 4096 KB
> > physical id : 3
> > siblings : 2
> > core id : 0
> > cpu cores : 2
> > apicid : 6
> > initial apicid : 6
> > fpu : yes
> > fpu_exception : yes
> > cpuid level : 10
> > wp : yes
> > flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good nopl cpuid aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca lahf_lm tpr_shadow dtherm
> > bugs :
> > bogomips : 5320.05
> > clflush size : 64
> > cache_alignment : 64
> > address sizes : 36 bits physical, 48 bits virtual
> > power management:
>
>
> Hmm nope not one of these.
> Need to poke at errata some more.

Intel lists two bugs with MWAIT on that Model:
http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-5100-spec-update.pdf

AG36. Split Locked Stores May not Trigger the Monitoring Hardware
AG106. A REP STOS/MOVS to a MONITOR/MWAIT Address Range May Prevent Triggering of the Monitoring Hardware

The latter can be dimissed as it should have been hit on bare metal as
well. The former looks pretty unlikely as well, but maybe the guest
maps w/b what bare metal would map differently?