Re: [PATCH v2 0/4] KVM: optimize the kvm_vcpu_on_spin
From: David Hildenbrand
Date: Tue Aug 08 2017 - 07:51:07 EST
On 08.08.2017 13:49, Longpeng (Mike) wrote:
>
>
> On 2017/8/8 19:25, David Hildenbrand wrote:
>
>> On 08.08.2017 06:05, Longpeng(Mike) wrote:
>>> This is a simple optimization for kvm_vcpu_on_spin, the
>>> main idea is described in patch-1's commit msg.
>>>
>>> I did some tests base on the RFC version, the result shows
>>> that it can improves the performance slightly.
>>>
>>> == Geekbench-3.4.1 ==
>>> VM1: 8U,4G, vcpu(0...7) is 1:1 pinned to pcpu(6...11,18,19)
>>> running Geekbench-3.4.1 *10 truns*
>>> VM2/VM3/VM4: configure is the same as VM1
>>> stress each vcpu usage(seed by top in guest) to 40%
>>>
>>> The comparison of each testcase's score:
>>> (higher is better)
>>> before after improve
>>> Inter
>>> single 1176.7 1179.0 0.2%
>>> multi 3459.5 3426.5 -0.9%
>>> Float
>>> single 1150.5 1150.9 0.0%
>>> multi 3364.5 3391.9 0.8%
>>> Memory(stream)
>>> single 1768.7 1773.1 0.2%
>>> multi 2511.6 2557.2 1.8%
>>> Overall
>>> single 1284.2 1286.2 0.2%
>>> multi 3231.4 3238.4 0.2%
>>>
>>>
>>> == kernbench-0.42 ==
>>> VM1: 8U,12G, vcpu(0...7) is 1:1 pinned to pcpu(6...11,18,19)
>>> running "kernbench -n 10"
>>> VM2/VM3/VM4: configure is the same as VM1
>>> stress each vcpu usage(seed by top in guest) to 40%
>>>
>>> The comparison of 'Elapsed Time':
>>> (sooner is better)
>>> before after improve
>>> load -j4 12.762 12.751 0.1%
>>> load -j32 9.743 8.955 8.1%
>>> load -j 9.688 9.229 4.7%
>>>
>>>
>>> Physical Machine:
>>> Architecture: x86_64
>>> CPU op-mode(s): 32-bit, 64-bit
>>> Byte Order: Little Endian
>>> CPU(s): 24
>>> On-line CPU(s) list: 0-23
>>> Thread(s) per core: 2
>>> Core(s) per socket: 6
>>> Socket(s): 2
>>> NUMA node(s): 2
>>> Vendor ID: GenuineIntel
>>> CPU family: 6
>>> Model: 45
>>> Model name: Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz
>>> Stepping: 7
>>> CPU MHz: 2799.902
>>> BogoMIPS: 5004.67
>>> Virtualization: VT-x
>>> L1d cache: 32K
>>> L1i cache: 32K
>>> L2 cache: 256K
>>> L3 cache: 15360K
>>> NUMA node0 CPU(s): 0-5,12-17
>>> NUMA node1 CPU(s): 6-11,18-23
>>>
>>> ---
>>> Changes since V1:
>>> - split the implementation of s390 & arm. [David]
>>> - refactor the impls according to the suggestion. [Paolo]
>>>
>>> Changes since RFC:
>>> - only cache result for X86. [David & Cornlia & Paolo]
>>> - add performance numbers. [David]
>>> - impls arm/s390. [Christoffer & David]
>>> - refactor the impls. [me]
>>>
>>> ---
>>> Longpeng(Mike) (4):
>>> KVM: add spinlock optimization framework
>>> KVM: X86: implement the logic for spinlock optimization
>>> KVM: s390: implements the kvm_arch_vcpu_in_kernel()
>>> KVM: arm: implements the kvm_arch_vcpu_in_kernel()
>>>
>>> arch/arm/kvm/handle_exit.c | 2 +-
>>> arch/arm64/kvm/handle_exit.c | 2 +-
>>> arch/mips/kvm/mips.c | 6 ++++++
>>> arch/powerpc/kvm/powerpc.c | 6 ++++++
>>> arch/s390/kvm/diag.c | 2 +-
>>> arch/s390/kvm/kvm-s390.c | 6 ++++++
>>> arch/x86/include/asm/kvm_host.h | 5 +++++
>>> arch/x86/kvm/hyperv.c | 2 +-
>>> arch/x86/kvm/svm.c | 10 +++++++++-
>>> arch/x86/kvm/vmx.c | 16 +++++++++++++++-
>>> arch/x86/kvm/x86.c | 11 +++++++++++
>>> include/linux/kvm_host.h | 3 ++-
>>> virt/kvm/arm/arm.c | 5 +++++
>>> virt/kvm/kvm_main.c | 4 +++-
>>> 14 files changed, 72 insertions(+), 8 deletions(-)
>>>
>>
>> I am curious, is there any architecture that allows to trigger
>> kvm_vcpu_on_spin(vcpu); while _not_ in kernel mode?
>
>
> IIUC, X86/SVM will trap to host due to PAUSE insn no matter the vcpu is in
> kernel-mode or user-mode.
>
>>
>> I would have guessed that user space should never be allowed to make cpu
>> wide decisions (giving up the CPU to the hypervisor).
>>
>> E.g. s390x diag can only be executed from kernel space. VMX PAUSE is
>> only valid from kernel space.
>
>
> X86/VMX has "PAUSE exiting" and "PAUSE-loop exiting"(PLE). KVM only uses PLE,
> this is as you said "only valid from kernel space"
>
> However, the "PAUSE exiting" can cause user-mode vcpu exit too.
Thanks Longpeng and Christoffer!
--
Thanks,
David