Re: [RFC][PATCH] Improving directed yield scalability for PLE handler
From: Avi Kivity
Date: Wed Sep 19 2012 - 09:40:24 EST
On 09/18/2012 06:03 AM, Andrew Theurer wrote:
> On Sun, 2012-09-16 at 11:55 +0300, Avi Kivity wrote:
>> On 09/14/2012 12:30 AM, Andrew Theurer wrote:
>>
>> > The concern I have is that even though we have gone through changes to
>> > help reduce the candidate vcpus we yield to, we still have a very poor
>> > idea of which vcpu really needs to run. The result is high cpu usage in
>> > the get_pid_task and still some contention in the double runqueue lock.
>> > To make this scalable, we either need to significantly reduce the
>> > occurrence of the lock-holder preemption, or do a much better job of
>> > knowing which vcpu needs to run (and not unnecessarily yielding to vcpus
>> > which do not need to run).
>> >
>> > On reducing the occurrence: The worst case for lock-holder preemption
>> > is having vcpus of same VM on the same runqueue. This guarantees the
>> > situation of 1 vcpu running while another [of the same VM] is not. To
>> > prove the point, I ran the same test, but with vcpus restricted to a
>> > range of host cpus, such that any single VM's vcpus can never be on the
>> > same runqueue. In this case, all 10 VMs' vcpu-0's are on host cpus 0-4,
>> > vcpu-1's are on host cpus 5-9, and so on. Here is the result:
>> >
>> > kvm_cpu_spin, and all
>> > yield_to changes, plus
>> > restricted vcpu placement: 8823 +/- 3.20% much, much better
>> >
>> > On picking a better vcpu to yield to: I really hesitate to rely on
>> > paravirt hint [telling us which vcpu is holding a lock], but I am not
>> > sure how else to reduce the candidate vcpus to yield to. I suspect we
>> > are yielding to way more vcpus than are prempted lock-holders, and that
>> > IMO is just work accomplishing nothing. Trying to think of way to
>> > further reduce candidate vcpus....
>>
>> I wouldn't say that yielding to the "wrong" vcpu accomplishes nothing.
>> That other vcpu gets work done (unless it is in pause loop itself) and
>> the yielding vcpu gets put to sleep for a while, so it doesn't spend
>> cycles spinning. While we haven't fixed the problem at least the guest
>> is accomplishing work, and meanwhile the real lock holder may get
>> naturally scheduled and clear the lock.
>
> OK, yes, if the other thread gets useful work done, then it is not
> wasteful. I was thinking of the worst case scenario, where any other
> vcpu would likely spin as well, and the host side cpu-time for switching
> vcpu threads was not all that productive. Well, I suppose it does help
> eliminate potential lock holding vcpus; it just seems to be not that
> efficient or fast enough.
If we have N-1 vcpus spinwaiting on 1 vcpu, with N:1 overcommit then
yes, we must iterate over N-1 vcpus until we find Mr. Right. Eventually
it's not-a-timeslice will expire and we go through this again. If
N*y_yield is comparable to the timeslice, we start losing efficiency.
Because of lock contention, t_yield can scale with the number of host
cpus. So in this worst case, we get quadratic behaviour.
One way out is to increase the not-a-timeslice. Can we get spinning
vcpus to do that for running vcpus, if they cannot find a
runnable-but-not-running vcpu?
That's not guaranteed to help, if we boost a running vcpu too much it
will skew how vcpu runtime is distributed even after the lock is released.
>
>> The main problem with this theory is that the experiments don't seem to
>> bear it out.
>
> Granted, my test case is quite brutal. It's nothing but over-committed
> VMs which always have some spin lock activity. However, we really
> should try to fix the worst case scenario.
Yes. And other guests may not scale as well as Linux, so they may show
this behaviour more often.
>
>> So maybe one of the assumptions is wrong - the yielding
>> vcpu gets scheduled early. That could be the case if the two vcpus are
>> on different runqueues - you could be changing the relative priority of
>> vcpus on the target runqueue, but still remain on top yourself. Is this
>> possible with the current code?
>>
>> Maybe we should prefer vcpus on the same runqueue as yield_to targets,
>> and only fall back to remote vcpus when we see it didn't help.
>>
>> Let's examine a few cases:
>>
>> 1. spinner on cpu 0, lock holder on cpu 0
>>
>> win!
>>
>> 2. spinner on cpu 0, random vcpu(s) (or normal processes) on cpu 0
>>
>> Spinner gets put to sleep, random vcpus get to work, low lock contention
>> (no double_rq_lock), by the time spinner gets scheduled we might have won
>>
>> 3. spinner on cpu 0, another spinner on cpu 0
>>
>> Worst case, we'll just spin some more. Need to detect this case and
>> migrate something in.
>
> Well, we can certainly experiment and see what we get.
>
> IMO, the key to getting this working really well on the large VMs is
> finding the lock-holding cpu -quickly-. What I think is happening is
> that we go through a relatively long process to get to that one right
> vcpu. I guess I need to find a faster way to get there.
pvspinlocks will find the right one, every time. Otherwise I see no way
to do this.
--
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/