Re: [PATCH v2 0/4] KVM: optimize the kvm_vcpu_on_spin

From: David Hildenbrand
Date: Tue Aug 08 2017 - 07:25:44 EST


On 08.08.2017 06:05, Longpeng(Mike) wrote:
> This is a simple optimization for kvm_vcpu_on_spin, the
> main idea is described in patch-1's commit msg.
>
> I did some tests base on the RFC version, the result shows
> that it can improves the performance slightly.
>
> == Geekbench-3.4.1 ==
> VM1: 8U,4G, vcpu(0...7) is 1:1 pinned to pcpu(6...11,18,19)
> running Geekbench-3.4.1 *10 truns*
> VM2/VM3/VM4: configure is the same as VM1
> stress each vcpu usage(seed by top in guest) to 40%
>
> The comparison of each testcase's score:
> (higher is better)
> before after improve
> Inter
> single 1176.7 1179.0 0.2%
> multi 3459.5 3426.5 -0.9%
> Float
> single 1150.5 1150.9 0.0%
> multi 3364.5 3391.9 0.8%
> Memory(stream)
> single 1768.7 1773.1 0.2%
> multi 2511.6 2557.2 1.8%
> Overall
> single 1284.2 1286.2 0.2%
> multi 3231.4 3238.4 0.2%
>
>
> == kernbench-0.42 ==
> VM1: 8U,12G, vcpu(0...7) is 1:1 pinned to pcpu(6...11,18,19)
> running "kernbench -n 10"
> VM2/VM3/VM4: configure is the same as VM1
> stress each vcpu usage(seed by top in guest) to 40%
>
> The comparison of 'Elapsed Time':
> (sooner is better)
> before after improve
> load -j4 12.762 12.751 0.1%
> load -j32 9.743 8.955 8.1%
> load -j 9.688 9.229 4.7%
>
>
> Physical Machine:
> Architecture: x86_64
> CPU op-mode(s): 32-bit, 64-bit
> Byte Order: Little Endian
> CPU(s): 24
> On-line CPU(s) list: 0-23
> Thread(s) per core: 2
> Core(s) per socket: 6
> Socket(s): 2
> NUMA node(s): 2
> Vendor ID: GenuineIntel
> CPU family: 6
> Model: 45
> Model name: Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz
> Stepping: 7
> CPU MHz: 2799.902
> BogoMIPS: 5004.67
> Virtualization: VT-x
> L1d cache: 32K
> L1i cache: 32K
> L2 cache: 256K
> L3 cache: 15360K
> NUMA node0 CPU(s): 0-5,12-17
> NUMA node1 CPU(s): 6-11,18-23
>
> ---
> Changes since V1:
> - split the implementation of s390 & arm. [David]
> - refactor the impls according to the suggestion. [Paolo]
>
> Changes since RFC:
> - only cache result for X86. [David & Cornlia & Paolo]
> - add performance numbers. [David]
> - impls arm/s390. [Christoffer & David]
> - refactor the impls. [me]
>
> ---
> Longpeng(Mike) (4):
> KVM: add spinlock optimization framework
> KVM: X86: implement the logic for spinlock optimization
> KVM: s390: implements the kvm_arch_vcpu_in_kernel()
> KVM: arm: implements the kvm_arch_vcpu_in_kernel()
>
> arch/arm/kvm/handle_exit.c | 2 +-
> arch/arm64/kvm/handle_exit.c | 2 +-
> arch/mips/kvm/mips.c | 6 ++++++
> arch/powerpc/kvm/powerpc.c | 6 ++++++
> arch/s390/kvm/diag.c | 2 +-
> arch/s390/kvm/kvm-s390.c | 6 ++++++
> arch/x86/include/asm/kvm_host.h | 5 +++++
> arch/x86/kvm/hyperv.c | 2 +-
> arch/x86/kvm/svm.c | 10 +++++++++-
> arch/x86/kvm/vmx.c | 16 +++++++++++++++-
> arch/x86/kvm/x86.c | 11 +++++++++++
> include/linux/kvm_host.h | 3 ++-
> virt/kvm/arm/arm.c | 5 +++++
> virt/kvm/kvm_main.c | 4 +++-
> 14 files changed, 72 insertions(+), 8 deletions(-)
>

I am curious, is there any architecture that allows to trigger
kvm_vcpu_on_spin(vcpu); while _not_ in kernel mode?

I would have guessed that user space should never be allowed to make cpu
wide decisions (giving up the CPU to the hypervisor).

E.g. s390x diag can only be executed from kernel space. VMX PAUSE is
only valid from kernel space.

I.o.w. do we need a parameter to kvm_vcpu_on_spin(vcpu); at all, or is
"me_in_kernel" basically always true?

--

Thanks,

David