Re: [PATCH v7 04/11] KVM: MMU: zap pages in batch
From: Xiao Guangrong
Date: Wed May 29 2013 - 09:09:24 EST
On 05/29/2013 07:11 PM, Marcelo Tosatti wrote:
> On Tue, May 28, 2013 at 11:02:09PM +0800, Xiao Guangrong wrote:
>> On 05/28/2013 08:18 AM, Marcelo Tosatti wrote:
>>> On Mon, May 27, 2013 at 10:20:12AM +0800, Xiao Guangrong wrote:
>>>> On 05/25/2013 04:34 AM, Marcelo Tosatti wrote:
>>>>> On Thu, May 23, 2013 at 03:55:53AM +0800, Xiao Guangrong wrote:
>>>>>> Zap at lease 10 pages before releasing mmu-lock to reduce the overload
>>>>>> caused by requiring lock
>>>>>>
>>>>>> After the patch, kvm_zap_obsolete_pages can forward progress anyway,
>>>>>> so update the comments
>>>>>>
>>>>>> [ It improves kernel building 0.6% ~ 1% ]
>>>>>
>>>>> Can you please describe the overload in more detail? Under what scenario
>>>>> is kernel building improved?
>>>>
>>>> Yes.
>>>>
>>>> The scenario is we do kernel building, meanwhile, repeatedly read PCI rom
>>>> every one second.
>>>>
>>>> [
>>>> echo 1 > /sys/bus/pci/devices/0000\:00\:03.0/rom
>>>> cat /sys/bus/pci/devices/0000\:00\:03.0/rom > /dev/null
>>>> ]
>>>
>>> Can't see why it reflects real world scenario (or a real world
>>> scenario with same characteristics regarding kvm_mmu_zap_all vs faults)?
>>>
>>> Point is, it would be good to understand why this change
>>> is improving performance? What are these cases where breaking out of
>>> kvm_mmu_zap_all due to either (need_resched || spin_needbreak) on zapped
>>> < 10 ?
>>
>> When guest read ROM, qemu will set the memory to map the device's firmware,
>> that is why kvm_mmu_zap_all can be called in the scenario.
>>
>> The reasons why it heart the performance are:
>> 1): Qemu use a global io-lock to sync all vcpu, so that the io-lock is held
>> when we do kvm_mmu_zap_all(). If kvm_mmu_zap_all() is not efficient, all
>> other vcpus need wait a long time to do I/O.
>>
>> 2): kvm_mmu_zap_all() is triggered in vcpu context. so it can block the IPI
>> request from other vcpus.
>>
>> Is it enough?
>
> That is no problem. The problem is why you chose "10" as the minimum number of
> pages to zap before considering reschedule. I would expect the need to
Well, my description above explained why batch-zapping is needed - we do
not want the vcpu spend lots of time to zap all pages because it hurts other
vcpus running.
But, why the batch page number is "10"... I can not answer this, i just guessed
that '10' can make vcpu do not spend long time on zap_all_pages and do
not cause mmu-lock too hungry. "10" is the speculative value and i am not sure
it is the best value but at lease, i think it can work.
> reschedule to be rare enough that one kvm_mmu_zap_all instance (between
> schedule in and schedule out) to be able to release no less than a
> thousand pages.
Unfortunately, no.
This information is I replied Gleb in his mail where he raced a question that
why "collapse tlb flush is needed":
======
It seems no.
Since we have reloaded mmu before zapping the obsolete pages, the mmu-lock
is easily contended. I did the simple track:
+ int num = 0;
restart:
list_for_each_entry_safe_reverse(sp, node,
&kvm->arch.active_mmu_pages, link) {
@@ -4265,6 +4265,7 @@ restart:
if (batch >= BATCH_ZAP_PAGES &&
cond_resched_lock(&kvm->mmu_lock)) {
batch = 0;
+ num++;
goto restart;
}
@@ -4277,6 +4278,7 @@ restart:
* may use the pages.
*/
kvm_mmu_commit_zap_page(kvm, &invalid_list);
+ printk("lock-break: %d.\n", num);
}
I do read pci rom when doing kernel building in the guest which
has 1G memory and 4vcpus with ept enabled, this is the normal
workload and normal configuration.
# dmesg
[ 2338.759099] lock-break: 8.
[ 2339.732442] lock-break: 5.
[ 2340.904446] lock-break: 3.
[ 2342.513514] lock-break: 3.
[ 2343.452229] lock-break: 3.
[ 2344.981599] lock-break: 4.
Basically, we need to break many times.
======
You can see we should break 3 times to zap all pages even if we have zapoed
10 pages in batch. It is obviously that it need break more times without
batch-zapping.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/