Re: [PATCH v4] kvm: better MWAIT emulation for guests

From: Gabriel L. Somlo
Date: Wed Mar 15 2017 - 19:30:53 EST


On Wed, Mar 15, 2017 at 09:46:18PM +0100, Radim KrÄmÃÅ wrote:
> 2017-03-15 16:21-0400, Gabriel L. Somlo:
> > On Wed, Mar 15, 2017 at 09:13:49PM +0100, Radim KrÄmÃÅ wrote:
> >> 2017-03-15 21:28+0200, Michael S. Tsirkin:
> >> > Guests running Mac OS 5, 6, and 7 (Leopard through Lion) have a problem:
> >> > unless explicitly provided with kernel command line argument
> >> > "idlehalt=0" they'd implicitly assume MONITOR and MWAIT availability,
> >> > without checking CPUID.
> >> >
> >> > We currently emulate that as a NOP but on VMX we can do better: let
> >> > guest stop the CPU until timer, IPI or memory change. CPU will be busy
> >> > but that isn't any worse than a NOP emulation.
> >> >
> >> > Note that mwait within guests is not the same as on real hardware
> >> > because halt causes an exit while mwait doesn't. For this reason it
> >> > might not be a good idea to use the regular MWAIT flag in CPUID to
> >> > signal this capability. Add a flag in the hypervisor leaf instead.
> >> >
> >> > Additionally, we add a capability for QEMU - e.g. if it knows there's an
> >> > isolated CPU dedicated for the VCPU it can set the standard MWAIT flag
> >> > to improve guest behaviour.
> >> >
> >> > Reported-by: "Gabriel L. Somlo" <gsomlo@xxxxxxxxx>
> >> > Signed-off-by: Michael S. Tsirkin <mst@xxxxxxxxxx>
> >> > ---
> >> >
> >> > Note: SVM bits are untested at this point. Seems pretty
> >> > obvious though.
> >> >
> >> > changes from v3:
> >> > - don't enable capability if cli+mwait blocks interrupts
> >> > - doc typo fixes (drop drop ppc doc)
> >> >
> >> > changes from v2:
> >> > - add a capability to allow host userspace to detect new kernels
> >> > - more documentation to clarify the semantics of the feature flag
> >> > and why it's useful
> >> > - svm support as suggested by Radim
> >> >
> >> > changes from v1:
> >> > - typo fix resulting in rest of leaf flags being overwritten
> >> > Reported by: Wanpeng Li <kernellwp@xxxxxxxxx>
> >> > - updated commit log with data about guests helped by this feature
> >> > - better document differences between mwait and halt for guests
> >> >
> >> > diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
> >> > @@ -212,4 +213,28 @@ static inline u64 nsec_to_cycles(struct kvm_vcpu *vcpu, u64 nsec)
> >> > __rem; \
> >> > })
> >> >
> >> > +static bool kvm_mwait_in_guest(void)
> >> > +{
> >> > + unsigned int eax, ebx, ecx;
> >> > +
> >> > + if (!cpu_has(&boot_cpu_data, X86_FEATURE_MWAIT))
> >> > + return -ENODEV;
> >> > +
> >> > + if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL)
> >> > + return -ENODEV;
> >> > +
> >> > + /*
> >> > + * Intel CPUs without CPUID5_ECX_INTERRUPT_BREAK are problematic as
> >> > + * they would allow guest to stop the CPU completely by disabling
> >> > + * interrupts then invoking MWAIT.
> >> > + */
> >> > + if (boot_cpu_data.cpuid_level < CPUID_MWAIT_LEAF)
> >> > + return -ENODEV;
> >> > +
> >> > + cpuid(CPUID_MWAIT_LEAF, &eax, &ebx, &ecx, &mwait_substates);
> >> > +
> >> > + if (!(ecx & CPUID5_ECX_INTERRUPT_BREAK))
> >> > + return -ENODEV;
> >>
> >> The guest is still able to set ecx=0 with MWAIT, which should be the
> >> same as not having the CPUID flag, so I'm wondering how this check
> >> prevents anything harmful ... is it really a cpu "feature"?
> >>
> >> If we somehow report ecx bit 1 in CPUID[5], then the guest might try to
> >> set ecx bit 0 for MWAIT, which will cause #GP(0) and could explain the
> >> hang that Gabriel is hitting.
> >>
> >> Gabriel,
> >>
> >> - do you see VM exits on the "hung" VCPU?
> >
> > how would I go about looking ?
>
> Probably the simplest would be to try install trace-cmd and do
>
> trace-cmd record -e kvm:kvm_entry
>
> for a while, then killing it with ^C and calling
>
> trace-cmd report
>
> it will have lines like:
>
> CPU-3729 [021] 5046.222480: kvm_entry: vcpu 7
>
> the first column is task id, the last one is VCPU id, so you'd verify that all
> VCPUs eventually enter.
> You could either look at the taks id of VCPUs that are running 100% of the time
> or just pipe through `grep -o 'vcpu.*' | sort -u` and hope that all of them are
> running, so you'd see nice list of them all. :)

so, with Michael's patch v5 (same behavior as earlier btw), I get:

# trace-cmd report | head -10000 | grep -o 'vcpu.*' | sort -u
vcpu 0
vcpu 1
vcpu 2
vcpu 3

Which means they're all there.

Also, 'top' shows 400% CPU for qemu-system-x86_64, which means I have
4 threads idle-looping at 100% each, which is consistent with a guest
using an MWAIT-based idle loop.

> If you don't see some VCPUs there, they might have been halted, which would
> show on
>
> virsh qemu-monitor-command rhel7 --hmp info cpus
>
> If there are running, but never entering VCPUs, then running
>
> trace-cmd record -e kvm:\*
>
> to capture all kvm events could tell us why it is stuck.
>
> >> - what is your CPU model?
> >
> > $ cat /proc/cpuinfo
> > model name : Intel(R) Xeon(R) CPU 5150 @ 2.66GHz
> >
> > (this is 2x dual-core Xeon on a Mac Pro 1,1 -- all I had to spare for
> > testing, to avoid having to reboot my primary desktop :)
>
> Oh, ancient when it comes to VMX. :)

I also have a somewhat more recent (only 4 year old) Macbook Air, I'll
try to redo the whole set of tests on that one as well.

>
> >> - what do you get after running this C program on host and guest?
> >
> > eax=0x000040 ebx=0x000040 ecx=0x000003 edx=0x000020
>
> Your CPU has the CPUID5_ECX_INTERRUPT_BREAK, thanks.
> The guest should see QEMU's default, which is
>
> eax=0x000000 ebx=0x000000 ecx=0x000003 edx=0x000000

That's right, I finally managed to transfer a linux guest image over
to the box, and indeed that's what it produces.

More tomorrow once I have a chance to test on the Macbook Air (need to
compile kvm git master on that one first, then transfer the os x qcow
image, etc...

Thanks,
--Gabriel