On Mon, 2012-10-15 at 17:40 +0530, Raghavendra K T wrote:On 10/11/2012 01:06 AM, Andrew Theurer wrote:On Wed, 2012-10-10 at 23:24 +0530, Raghavendra K T wrote:[...]On 10/10/2012 08:29 AM, Andrew Theurer wrote:On Wed, 2012-10-10 at 00:21 +0530, Raghavendra K T wrote:* Avi Kivity <avi@xxxxxxxxxx> [2012-10-04 17:00:28]:
On 10/04/2012 03:07 PM, Peter Zijlstra wrote:On Thu, 2012-10-04 at 14:41 +0200, Avi Kivity wrote:
A big concern I have (if this is 1x overcommit) for ebizzy is that it
has just terrible scalability to begin with. I do not think we should
try to optimize such a bad workload.
I think my way of running dbench has some flaw, so I went to ebizzy.
Could you let me know how you generally run dbench?
I mount a tmpfs and then specify that mount for dbench to run on. This
eliminates all IO. I use a 300 second run time and number of threads is
equal to number of vcpus. All of the VMs of course need to have a
I would also make sure you are using a recent kernel for dbench, where
the dcache scalability is much improved. Without any lock-holder
preemption, the time in spin_lock should be very low:
21.54% 78016 dbench [kernel.kallsyms] [k] copy_user_generic_unrolled
3.51% 12723 dbench libc-2.12.so [.] __strchr_sse42
2.81% 10176 dbench dbench [.] child_run
2.54% 9203 dbench [kernel.kallsyms] [k] _raw_spin_lock
2.33% 8423 dbench dbench [.] next_token
2.02% 7335 dbench [kernel.kallsyms] [k] __d_lookup_rcu
1.89% 6850 dbench libc-2.12.so [.] __strstr_sse42
1.53% 5537 dbench libc-2.12.so [.] __memset_sse2
1.47% 5337 dbench [kernel.kallsyms] [k] link_path_walk
1.40% 5084 dbench [kernel.kallsyms] [k] kmem_cache_alloc
1.38% 5009 dbench libc-2.12.so [.] memmove
1.24% 4496 dbench libc-2.12.so [.] vfprintf
1.15% 4169 dbench [kernel.kallsyms] [k] __audit_syscall_exit
I ran the test with dbench with tmpfs. I do not see any improvements in
dbench for 16k ple window.
So it seems apart from ebizzy no workload benefited by that. and I
agree that, it may not be good to optimize for ebizzy.
I shall drop changing to 16k default window and continue with other
original patch series. Need to experiment with latest kernel.
Thanks for running this again. I do believe there are some workloads,
when run at 1x overcommit, would benefit from a larger ple_window [with
he current ple handling code], but I do not also want to potentially
degrade >1x with a larger window. I do, however, think there may be a
another option. I have not fully worked this out, but I think I am on
I decided to revert back to just a yield() instead of a yield_to(). My
motivation was that yield_to() [for large VMs] is like a dog chasing its
tail, round and round we go.... Just yield(), in particular a yield()
which results in yielding to something -other- than the current VM's
vcpus, helps synchronize the execution of sibling vcpus by deferring
them until the lock holder vcpu is running again. The more we can do to
get all vcpus running at the same time, the far less we deal with the
preemption problem. The other benefit is that yield() is far, far lower
overhead than yield_to()
This does assume that vcpus from same VM do not share same runqueues.
Yielding to a sibling vcpu with yield() is not productive for larger VMs
in the same way that yield_to() is not. My recent results include
restricting vcpu placement so that sibling vcpus do not get to run on
the same runqueue. I do believe we could implement a initial placement
and load balance policy to strive for this restriction (making it purely
optional, but I bet could also help user apps which use spin locks).
For 1x VMs which still vm_exit due to PLE, I believe we could probably
just leave the ple_window alone, as long as we mostly use yield()
instead of yield_to(). The problem with the unneeded exits in this case
has been the overhead in routines leading up to yield_to() and the
yield_to() itself. If we use yield() most of the time, this overhead
will go away.
Here is a comparison of yield_to() and yield():
dbench with 20-way VMs, 8 of them on 80-way host:
no PLE 426 +/- 11.03%
no PLE w/ gangsched 32001 +/- .37%
PLE with yield() 29207 +/- .28%
PLE with yield_to() 8175 +/- 1.37%
Yield() is far and way better than yield_to() here and almost approaches
gang sched result. Here is a link for the perf sched map bitmap:
The thrashing is way down and sibling vcpus tend to run together,
approximating the behavior of the gang scheduling without needing to
actually implement gang scheduling.
I did test a smaller VM:
dbench with 10-way VMs, 16 of them on 80-way host:
no PLE 6248 +/- 7.69%
no PLE w/ gangsched 28379 +/- .07%
PLE with yield() 29196 +/- 1.62%
PLE with yield_to() 32217 +/- 1.76%
There is some degrade from yield() to yield_to() here, but nearly as
large as the uplift we see on the larger VMs. Regardless, I have an
idea to fix that: Instead of using yield() all the time, we could use
yield_to(), but limit the rate per vcpu to something like 1 per jiffie.
All other exits use yield(). That rate of yield_to() should be more
than enough for the smaller VMs, and the result should be hopefully just
the same as the current code. I have not coded this up yet, but it's my
I am also hopeful the limitation of yield_to() will also make the 1x
issue just go away as well (even with 4096 ple_window). The vast
majority of exits will result in yield() which should be harmless.
Keep in mind this did require ensuring sibling vcpus do not share host
runqueues -I do think that can be possible given some optional scheduler