On Mon, Sep 24, 2012 at 02:36:05PM +0200, Peter Zijlstra wrote:On Mon, 2012-09-24 at 17:22 +0530, Raghavendra K T wrote:On 09/24/2012 05:04 PM, Peter Zijlstra wrote:On Fri, 2012-09-21 at 17:29 +0530, Raghavendra K T wrote:Both vmexit and yield_to() actually,In some special scenarios like #vcpu<= #pcpu, PLE handler may
prove very costly, because there is no need to iterate over vcpus
and do unsuccessful yield_to burning CPU.
What's the costly thing? The vm-exit, the yield (which should be a nop
if its the only task there) or something else entirely?
because unsuccessful yield_to() overall is costly in PLE handler.
This is because when we have large guests, say 32/16 vcpus, and one
vcpu is holding lock, rest of the vcpus waiting for the lock, when they
do PL-exit, each of the vcpu try to iterate over rest of vcpu list in
the VM and try to do directed yield (unsuccessful). (O(n^2) tries).
this results is fairly high amount of cpu burning and double run queue
lock contention.
(if they were spinning probably lock progress would have been faster).
As Avi/Chegu Vinod had felt it is better to avoid vmexit itself, which
seems little complex to achieve currently.
OK, so the vmexit stays and we need to improve yield_to.
Can't we do this check sooner as well, as it only requires per-cpu data?
If we do it way back in kvm_vcpu_on_spin, then we avoid get_pid_task()
and a bunch of read barriers from kvm_for_each_vcpu. Also, moving the test
into kvm code would allow us to do other kvm things as a result of the
check in order to avoid some vmexits. It looks like we should be able to
avoid some without much complexity by just making a per-vm ple_window
variable, and then, when we hit the nr_running == 1 condition, also doing
vmcs_write32(PLE_WINDOW, (kvm->ple_window += PLE_WINDOW_BUMP))
Reset the window to the default value when we successfully yield (and
maybe we should limit the number of bumps).
Base = 3.6.0-rc5 + ple handler optimization patches
A = Base + checking rq_running in vcpu_on_spin() patch
B = Base + checking rq->nr_running in sched/core
C = Base - PLE
% improvements w.r.t BASE
---+------------+------------+------------+
| A | B | C |
---+------------+------------+------------+
1x | 206.37603 | 139.70410 | 210.19323 |
vmcs_write32(PLE_WINDOW, (kvm->ple_window += PLE_WINDOW_BUMP))