Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLEhandler

From: Andrew Theurer
Date: Fri Oct 19 2012 - 09:31:33 EST

Next message: Ulf Hansson: "[PATCH 2/2] ASoC: Ux500: Fixup compile error"
Previous message: Ulf Hansson: "[PATCH 1/2] ASoC: Ux500: Fixup complile errors due to merge"
In reply to: Raghavendra K T: "Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLEhandler"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Fri, 2012-10-19 at 14:00 +0530, Raghavendra K T wrote:
> On 10/15/2012 08:04 PM, Andrew Theurer wrote:
> > On Mon, 2012-10-15 at 17:40 +0530, Raghavendra K T wrote:
> >> On 10/11/2012 01:06 AM, Andrew Theurer wrote:
> >>> On Wed, 2012-10-10 at 23:24 +0530, Raghavendra K T wrote:
> >>>> On 10/10/2012 08:29 AM, Andrew Theurer wrote:
> >>>>> On Wed, 2012-10-10 at 00:21 +0530, Raghavendra K T wrote:
> >>>>>> * Avi Kivity <avi@xxxxxxxxxx> [2012-10-04 17:00:28]:
> >>>>>>
> >>>>>>> On 10/04/2012 03:07 PM, Peter Zijlstra wrote:
> >>>>>>>> On Thu, 2012-10-04 at 14:41 +0200, Avi Kivity wrote:
> >>>>>>>>>
> >> [...]
> >>>>> A big concern I have (if this is 1x overcommit) for ebizzy is that it
> >>>>> has just terrible scalability to begin with. I do not think we should
> >>>>> try to optimize such a bad workload.
> >>>>>
> >>>>
> >>>> I think my way of running dbench has some flaw, so I went to ebizzy.
> >>>> Could you let me know how you generally run dbench?
> >>>
> >>> I mount a tmpfs and then specify that mount for dbench to run on. This
> >>> eliminates all IO. I use a 300 second run time and number of threads is
> >>> equal to number of vcpus. All of the VMs of course need to have a
> >>> synchronized start.
> >>>
> >>> I would also make sure you are using a recent kernel for dbench, where
> >>> the dcache scalability is much improved. Without any lock-holder
> >>> preemption, the time in spin_lock should be very low:
> >>>
> >>>
> >>>> 21.54% 78016 dbench [kernel.kallsyms] [k] copy_user_generic_unrolled
> >>>> 3.51% 12723 dbench libc-2.12.so [.] __strchr_sse42
> >>>> 2.81% 10176 dbench dbench [.] child_run
> >>>> 2.54% 9203 dbench [kernel.kallsyms] [k] _raw_spin_lock
> >>>> 2.33% 8423 dbench dbench [.] next_token
> >>>> 2.02% 7335 dbench [kernel.kallsyms] [k] __d_lookup_rcu
> >>>> 1.89% 6850 dbench libc-2.12.so [.] __strstr_sse42
> >>>> 1.53% 5537 dbench libc-2.12.so [.] __memset_sse2
> >>>> 1.47% 5337 dbench [kernel.kallsyms] [k] link_path_walk
> >>>> 1.40% 5084 dbench [kernel.kallsyms] [k] kmem_cache_alloc
> >>>> 1.38% 5009 dbench libc-2.12.so [.] memmove
> >>>> 1.24% 4496 dbench libc-2.12.so [.] vfprintf
> >>>> 1.15% 4169 dbench [kernel.kallsyms] [k] __audit_syscall_exit
> >>>
> >>
> >> Hi Andrew,
> >> I ran the test with dbench with tmpfs. I do not see any improvements in
> >> dbench for 16k ple window.
> >>
> >> So it seems apart from ebizzy no workload benefited by that. and I
> >> agree that, it may not be good to optimize for ebizzy.
> >> I shall drop changing to 16k default window and continue with other
> >> original patch series. Need to experiment with latest kernel.
> >
> > Thanks for running this again. I do believe there are some workloads,
> > when run at 1x overcommit, would benefit from a larger ple_window [with
> > he current ple handling code], but I do not also want to potentially
> > degrade >1x with a larger window. I do, however, think there may be a
> > another option. I have not fully worked this out, but I think I am on
> > to something.
> >
> > I decided to revert back to just a yield() instead of a yield_to(). My
> > motivation was that yield_to() [for large VMs] is like a dog chasing its
> > tail, round and round we go.... Just yield(), in particular a yield()
> > which results in yielding to something -other- than the current VM's
> > vcpus, helps synchronize the execution of sibling vcpus by deferring
> > them until the lock holder vcpu is running again. The more we can do to
> > get all vcpus running at the same time, the far less we deal with the
> > preemption problem. The other benefit is that yield() is far, far lower
> > overhead than yield_to()
> >
> > This does assume that vcpus from same VM do not share same runqueues.
> > Yielding to a sibling vcpu with yield() is not productive for larger VMs
> > in the same way that yield_to() is not. My recent results include
> > restricting vcpu placement so that sibling vcpus do not get to run on
> > the same runqueue. I do believe we could implement a initial placement
> > and load balance policy to strive for this restriction (making it purely
> > optional, but I bet could also help user apps which use spin locks).
> >
> > For 1x VMs which still vm_exit due to PLE, I believe we could probably
> > just leave the ple_window alone, as long as we mostly use yield()
> > instead of yield_to(). The problem with the unneeded exits in this case
> > has been the overhead in routines leading up to yield_to() and the
> > yield_to() itself. If we use yield() most of the time, this overhead
> > will go away.
> >
> > Here is a comparison of yield_to() and yield():
> >
> > dbench with 20-way VMs, 8 of them on 80-way host:
> >
> > no PLE 426 +/- 11.03%
> > no PLE w/ gangsched 32001 +/- .37%
> > PLE with yield() 29207 +/- .28%
> > PLE with yield_to() 8175 +/- 1.37%
> >
> > Yield() is far and way better than yield_to() here and almost approaches
> > gang sched result. Here is a link for the perf sched map bitmap:
> >
> > https://docs.google.com/open?id=0B6tfUNlZ-14weXBfVnFFZGw1akU
> >
> > The thrashing is way down and sibling vcpus tend to run together,
> > approximating the behavior of the gang scheduling without needing to
> > actually implement gang scheduling.
> >
> > I did test a smaller VM:
> >
> > dbench with 10-way VMs, 16 of them on 80-way host:
> >
> > no PLE 6248 +/- 7.69%
> > no PLE w/ gangsched 28379 +/- .07%
> > PLE with yield() 29196 +/- 1.62%
> > PLE with yield_to() 32217 +/- 1.76%
>
> Hi Andrew, Results are encouraging.
>
> >
> > There is some degrade from yield() to yield_to() here, but nearly as
> > large as the uplift we see on the larger VMs. Regardless, I have an
> > idea to fix that: Instead of using yield() all the time, we could use
> > yield_to(), but limit the rate per vcpu to something like 1 per jiffie.
> > All other exits use yield(). That rate of yield_to() should be more
> > than enough for the smaller VMs, and the result should be hopefully just
> > the same as the current code. I have not coded this up yet, but it's my
> > next step.
>
> I personally feel rate limiting yield_to may be a good idea.
>
> >
> > I am also hopeful the limitation of yield_to() will also make the 1x
> > issue just go away as well (even with 4096 ple_window). The vast
> > majority of exits will result in yield() which should be harmless.
> >
> > Keep in mind this did require ensuring sibling vcpus do not share host
> > runqueues -I do think that can be possible given some optional scheduler
> > tweaks.
>
> I think this is a concern (placing). Having rate limit alone may
> suffice.May be tuning that taking into overcommitted/non-overcommitted
> scenario also into account would be better.
>
> Okay below is my V2 implementation I am experimenting
>
> 1) check source -and- target runq to decide on exiting the ple handler
> 2)
>
> vcpu_on_spin()
> {
>
> .....
> if yield_to_same_vm did not succeed and we are overcommitted
> yield()
>
> }
>
> I think combining your thoughts and (2) complicates scenario a bit.
> anyways let me see how my experiment goes. I will also check how yield
> performs without any pinning.

FWIW, below is the latest with throttling yield_to(). Results were
slightly higher than the above with just yield(). Although I can see an
improvement when not forcing non-shared runqueues among same-VM vcpus
(via binding), it's not as effective. I am more concerned this problem
requires a multi-part solution, and reducing lock-holder preemption is
the other part (by not allowing sequential execution of same-VM vcpus by
virtue of sharing runqueues).

signed-off-by: Andrew Theurer <habanero@xxxxxxxxxxxxxxxxxx>

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index b70b48b..595ef3e 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -153,6 +153,7 @@ struct kvm_vcpu {
int mode;
unsigned long requests;
unsigned long guest_debug;
+ unsigned long last_yield_to;

struct mutex mutex;
struct kvm_run *run;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index d617f69..1f0ec36 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -49,6 +49,7 @@
#include <linux/slab.h>
#include <linux/sort.h>
#include <linux/bsearch.h>
+#include <linux/jiffies.h>

#include <asm/processor.h>
#include <asm/io.h>
@@ -228,6 +229,7 @@ int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id)
vcpu->pid = NULL;
init_waitqueue_head(&vcpu->wq);
kvm_async_pf_vcpu_init(vcpu);
+ vcpu->last_yield_to = 0;

page = alloc_page(GFP_KERNEL | __GFP_ZERO);
if (!page) {
@@ -1590,27 +1592,39 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
int i;

/*
+ * A yield_to() can be quite expensive, so we try to limit its use.
+ * to one per jiffie. Subsequent exits just yield the current vcpu
+ * in hopes of having it run again when the lock holding vcpu
+ * gets to run again. This is most effective when vcpus from
+ * the same VM do not share a runqueue
+ */
+ if (me->last_yield_to == jiffies) {
+ yield();
+ } else {
+ /*
* We boost the priority of a VCPU that is runnable but not
* currently running, because it got preempted by something
* else and called schedule in __vcpu_run. Hopefully that
* VCPU is holding the lock that we need and will release it.
* We approximate round-robin by starting at the last boosted VCPU.
*/
- for (pass = 0; pass < 2 && !yielded; pass++) {
- kvm_for_each_vcpu(i, vcpu, kvm) {
- if (!pass && i <= last_boosted_vcpu) {
- i = last_boosted_vcpu;
- continue;
- } else if (pass && i > last_boosted_vcpu)
- break;
- if (vcpu == me)
- continue;
- if (waitqueue_active(&vcpu->wq))
- continue;
- if (kvm_vcpu_yield_to(vcpu)) {
- kvm->last_boosted_vcpu = i;
- yielded = 1;
- break;
+ for (pass = 0; pass < 2 && !yielded; pass++) {
+ kvm_for_each_vcpu(i, vcpu, kvm) {
+ if (!pass && i <= last_boosted_vcpu) {
+ i = last_boosted_vcpu;
+ continue;
+ } else if (pass && i > last_boosted_vcpu)
+ break;
+ if (vcpu == me)
+ continue;
+ if (waitqueue_active(&vcpu->wq))
+ continue;
+ if (kvm_vcpu_yield_to(vcpu)) {
+ kvm->last_boosted_vcpu = i;
+ me->last_yield_to = jiffies;
+ yielded = 1;
+ break;
+ }
}
}
}

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Ulf Hansson: "[PATCH 2/2] ASoC: Ux500: Fixup compile error"
Previous message: Ulf Hansson: "[PATCH 1/2] ASoC: Ux500: Fixup complile errors due to merge"
In reply to: Raghavendra K T: "Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLEhandler"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]