Re: [PATCH v5 untested] kvm: better MWAIT emulation for guests

From: Gabriel L. Somlo
Date: Thu Mar 16 2017 - 10:59:37 EST


On Thu, Mar 16, 2017 at 04:04:12PM +0200, Michael S. Tsirkin wrote:
> On Thu, Mar 16, 2017 at 09:24:27AM -0400, Gabriel L. Somlo wrote:
> > After studying your patch a bit more carefully (sorry, it's crazy
> > around here right now :) ) I realized you're simply trying to
> > (selectively) decide when to exit L1 and emulate as NOP vs. when to
> > just allow L1 to execute MONITOR & MWAIT natively.
> >
> > Is that right ? Because if so, the issues I saw on my MacPro1,1 are
> > weird and inexplicable, given that allowing L>=1 to run MONITOR/MWAIT
> > natively was one of the options Alex Graf and Rene Rebe used back in
> > the very early days of OS X on QEMU, at the time I got involved with
> > that project. Here's part of an out of tree patch against 3.4 which did
> > just that, and worked as far as I remember on *any* MWAIT capable
> > intel chip I had access to back in 2010:
> >
> > ##############################################################################
> > # 99-mwait.patch.kvm-kmod (Rene Rebe <rene@xxxxxxxxxxxx>) 2010-04-27
> > ##############################################################################
> > diff -pNarU5 linux-3.4/arch/x86/kvm/cpuid.c linux-3.4-mac/arch/x86/kvm/cpuid.c
> > --- linux-3.4/arch/x86/kvm/cpuid.c 2012-05-20 18:29:13.000000000 -0400
> > +++ linux-3.4-mac/arch/x86/kvm/cpuid.c 2012-10-09 11:42:59.921215750 -0400
> > @@ -222,11 +222,11 @@ static int do_cpuid_ent(struct kvm_cpuid
> > f_nx | 0 /* Reserved */ | F(MMXEXT) | F(MMX) |
> > F(FXSR) | F(FXSR_OPT) | f_gbpages | f_rdtscp |
> > 0 /* Reserved */ | f_lm | F(3DNOWEXT) | F(3DNOW);
> > /* cpuid 1.ecx */
> > const u32 kvm_supported_word4_x86_features =
> > - F(XMM3) | F(PCLMULQDQ) | 0 /* DTES64, MONITOR */ |
> > + F(XMM3) | F(PCLMULQDQ) | F(MWAIT) /* DTES64, MONITOR */ |
> > 0 /* DS-CPL, VMX, SMX, EST */ |
> > 0 /* TM2 */ | F(SSSE3) | 0 /* CNXT-ID */ | 0 /* Reserved */ |
> > F(FMA) | F(CX16) | 0 /* xTPR Update, PDCM */ |
> > 0 /* Reserved, DCA */ | F(XMM4_1) |
> > F(XMM4_2) | F(X2APIC) | F(MOVBE) | F(POPCNT) |
> > diff -pNarU5 linux-3.4/arch/x86/kvm/svm.c linux-3.4-mac/arch/x86/kvm/svm.c
> > --- linux-3.4/arch/x86/kvm/svm.c 2012-05-20 18:29:13.000000000 -0400
> > +++ linux-3.4-mac/arch/x86/kvm/svm.c 2012-10-09 11:44:41.598997481 -0400
> > @@ -1102,12 +1102,10 @@ static void init_vmcb(struct vcpu_svm *s
> > set_intercept(svm, INTERCEPT_VMSAVE);
> > set_intercept(svm, INTERCEPT_STGI);
> > set_intercept(svm, INTERCEPT_CLGI);
> > set_intercept(svm, INTERCEPT_SKINIT);
> > set_intercept(svm, INTERCEPT_WBINVD);
> > - set_intercept(svm, INTERCEPT_MONITOR);
> > - set_intercept(svm, INTERCEPT_MWAIT);
> > set_intercept(svm, INTERCEPT_XSETBV);
> >
> > control->iopm_base_pa = iopm_base;
> > control->msrpm_base_pa = __pa(svm->msrpm);
> > control->int_ctl = V_INTR_MASKING_MASK;
> > diff -pNarU5 linux-3.4/arch/x86/kvm/vmx.c linux-3.4-mac/arch/x86/kvm/vmx.c
> > --- linux-3.4/arch/x86/kvm/vmx.c 2012-05-20 18:29:13.000000000 -0400
> > +++ linux-3.4-mac/arch/x86/kvm/vmx.c 2012-10-09 11:42:59.925215977 -0400
> > @@ -1938,11 +1938,11 @@ static __init void nested_vmx_setup_ctls
> > nested_vmx_procbased_ctls_low, nested_vmx_procbased_ctls_high);
> > nested_vmx_procbased_ctls_low = 0;
> > nested_vmx_procbased_ctls_high &=
> > CPU_BASED_VIRTUAL_INTR_PENDING | CPU_BASED_USE_TSC_OFFSETING |
> > CPU_BASED_HLT_EXITING | CPU_BASED_INVLPG_EXITING |
> > - CPU_BASED_MWAIT_EXITING | CPU_BASED_CR3_LOAD_EXITING |
> > + CPU_BASED_CR3_LOAD_EXITING |
> > CPU_BASED_CR3_STORE_EXITING |
> > #ifdef CONFIG_X86_64
> > CPU_BASED_CR8_LOAD_EXITING | CPU_BASED_CR8_STORE_EXITING |
> > #endif
> > CPU_BASED_MOV_DR_EXITING | CPU_BASED_UNCOND_IO_EXITING |
> > @@ -2404,12 +2404,10 @@ static __init int setup_vmcs_config(stru
> > CPU_BASED_CR3_LOAD_EXITING |
> > CPU_BASED_CR3_STORE_EXITING |
> > CPU_BASED_USE_IO_BITMAPS |
> > CPU_BASED_MOV_DR_EXITING |
> > CPU_BASED_USE_TSC_OFFSETING |
> > - CPU_BASED_MWAIT_EXITING |
> > - CPU_BASED_MONITOR_EXITING |
> > CPU_BASED_INVLPG_EXITING |
> > CPU_BASED_RDPMC_EXITING;
> >
> > opt = CPU_BASED_TPR_SHADOW |
> > CPU_BASED_USE_MSR_BITMAPS |
> >
> > If all you're trying to do is (selectively) revert to this behavior,
> > that "shouldn't" mess it up for the MacPro either, so I'm thoroughly
> > confused at this point :)
>
> Yes. Me too. Want to try that other patch and see what happens?

You mean the old 3.4 patch against current KVM ? I'll try to do that,
might take me a while :)

> > Back in 2010, running MWAIT in L>=1 behaved 100% exactly like a NOP,
> > didn't power down the physical CPU, just immediately moved on to the
> > next instruction. As such, there was no power saving and no
> > opportunity to yield to another L0 thread either, unlike with NOP
> > emulation at L0.
> >
> > Did that change on newer Intel chips (i.e., is guest-mode MWAIT now
> > doing something smarter than just acting as a guest-mode NOP) ?
> >
> > Thanks,
> > --Gabriel
>
> Interesting. What it seems to say is this:
>
> MWAIT. Behavior of the MWAIT instruction (which always causes an invalid-
> opcode exceptionâ#UDâif CPL > 0) is determined by the setting of the âMWAIT
> exitingâ VM-execution control:
> â If the âMWAIT exitingâ VM-execution control is 1, MWAIT causes a VM exit
> (see Section 22.1.3).
> â If the âMWAIT exitingâ VM-execution control is 0, MWAIT operates normally if
> any of the following is true: (1) the âinterrupt-window exitingâ VM-execution
> control is 0; (2) ECX[0] is 0; or (3) RFLAGS.IF = 1.
> â If the âMWAIT exitingâ VM-execution control is 0, the âinterrupt-window
> exitingâ VM-execution control is 1, ECX[0] = 1, and RFLAGS.IF = 0, MWAIT
> does not cause the processor to enter an implementation-dependent
> optimized state; instead, control passes to the instruction following the
> MWAIT instruction.
>
>
> And since interrupt-window exiting is 0 most of the time for KVM,
> I would expect MWAIT to behave normally.

The intel manual said the same thing back in 2010 as well. However,
regardless of how any flags were set, interrupt-window exiting or not,
"normal" L1 MWAIT behavior was that it woke up immediately regardless.
Remember, never going to sleep is still correct ("normal" ?) behavior
per the ISA definition of MWAIT :)

Also, when I tested your patch on the macbook air (where it worked),
not only was the host reporting 400% CPU for qemu (which is to be
expected), but the thermal fan/cooling thing also shifted up into high
gear, which means the physical CPU got hot, which it shouldn't have if
the guest-mode MWAIT actually did put the host CPU into low power.

So at least on this 4-year-old core-I7 chip, the story Intel tells in
its manual still doesn't check out. I could never get any
clarification on what they mean by "operates normally" :)