Re: [PATCH v4 4/6] KVM: MMU: fast invalid all shadow pages

From: Xiao Guangrong
Date: Mon May 06 2013 - 01:15:42 EST


On 05/04/2013 08:52 AM, Marcelo Tosatti wrote:
> On Sat, May 04, 2013 at 12:51:06AM +0800, Xiao Guangrong wrote:
>> On 05/03/2013 11:53 PM, Marcelo Tosatti wrote:
>>> On Fri, May 03, 2013 at 01:52:07PM +0800, Xiao Guangrong wrote:
>>>> On 05/03/2013 09:05 AM, Marcelo Tosatti wrote:
>>>>
>>>>>> +
>>>>>> +/*
>>>>>> + * Fast invalid all shadow pages belong to @slot.
>>>>>> + *
>>>>>> + * @slot != NULL means the invalidation is caused the memslot specified
>>>>>> + * by @slot is being deleted, in this case, we should ensure that rmap
>>>>>> + * and lpage-info of the @slot can not be used after calling the function.
>>>>>> + *
>>>>>> + * @slot == NULL means the invalidation due to other reasons, we need
>>>>>> + * not care rmap and lpage-info since they are still valid after calling
>>>>>> + * the function.
>>>>>> + */
>>>>>> +void kvm_mmu_invalid_memslot_pages(struct kvm *kvm,
>>>>>> + struct kvm_memory_slot *slot)
>>>>>> +{
>>>>>> + spin_lock(&kvm->mmu_lock);
>>>>>> + kvm->arch.mmu_valid_gen++;
>>>>>> +
>>>>>> + /*
>>>>>> + * All shadow paes are invalid, reset the large page info,
>>>>>> + * then we can safely desotry the memslot, it is also good
>>>>>> + * for large page used.
>>>>>> + */
>>>>>> + kvm_clear_all_lpage_info(kvm);
>>>>>
>>>>> Xiao,
>>>>>
>>>>> I understood it was agreed that simple mmu_lock lockbreak while
>>>>> avoiding zapping of newly instantiated pages upon a
>>>>>
>>>>> if(spin_needbreak)
>>>>> cond_resched_lock()
>>>>>
>>>>> cycle was enough as a first step? And then later introduce root zapping
>>>>> along with measurements.
>>>>>
>>>>> https://lkml.org/lkml/2013/4/22/544
>>>>
>>>> Yes, it is.
>>>>
>>>> See the changelog in 0/0:
>>>>
>>>> " we use lock-break technique to zap all sptes linked on the
>>>> invalid rmap, it is not very effective but good for the first step."
>>>>
>>>> Thanks!
>>>
>>> Sure, but what is up with zeroing kvm_clear_all_lpage_info(kvm) and
>>> zapping the root? Only lock-break technique along with generation number
>>> was what was agreed.
>>
>> Marcelo,
>>
>> Please Wait... I am completely confused. :(
>>
>> Let's clarify "zeroing kvm_clear_all_lpage_info(kvm) and zapping the root" first.
>> Are these changes you wanted?
>>
>> void kvm_mmu_invalid_memslot_pages(struct kvm *kvm,
>> struct kvm_memory_slot *slot)
>> {
>> spin_lock(&kvm->mmu_lock);
>> kvm->arch.mmu_valid_gen++;
>>
>> /* Zero all root pages.*/
>> restart:
>> list_for_each_entry_safe(sp, node, &kvm->arch.active_mmu_pages, link) {
>> if (!sp->root_count)
>> continue;
>>
>> if (kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list))
>> goto restart;
>> }
>>
>> /*
>> * All shadow paes are invalid, reset the large page info,
>> * then we can safely desotry the memslot, it is also good
>> * for large page used.
>> */
>> kvm_clear_all_lpage_info(kvm);
>>
>> kvm_mmu_commit_zap_page(kvm, &invalid_list);
>> spin_unlock(&kvm->mmu_lock);
>> }
>>
>> static void rmap_remove(struct kvm *kvm, u64 *spte)
>> {
>> struct kvm_mmu_page *sp;
>> gfn_t gfn;
>> unsigned long *rmapp;
>>
>> sp = page_header(__pa(spte));
>> +
>> + /* Let invalid sp do not access its rmap. */
>> + if (!sp_is_valid(sp))
>> + return;
>> +
>> gfn = kvm_mmu_page_get_gfn(sp, spte - sp->spt);
>> rmapp = gfn_to_rmap(kvm, gfn, sp->role.level);
>> pte_list_remove(spte, rmapp);
>> }
>>
>> If yes, there is the reason why we can not do this that i mentioned before:
>>
>> after call kvm_mmu_invalid_memslot_pages(), the memslot->rmap will be destroyed.
>> Later, if host reclaim page, the mmu-notify handlers, ->invalidate_page and
>> ->invalidate_range_start, can not find any spte using the host page, then
>> Accessed/Dirty for host page is missing tracked.
>> (missing call kvm_set_pfn_accessed and kvm_set_pfn_dirty properly.)
>>
>> What's your idea?
>
>
> Step 1) Fix kvm_mmu_zap_all's behaviour: introduce lockbreak via
> spin_needbreak. Use generation numbers so that in case kvm_mmu_zap_all
> releases mmu_lock and reacquires it again, only shadow pages
> from the generation with which kvm_mmu_zap_all started are zapped (this
> guarantees forward progress and eventual termination).
>
> kvm_mmu_zap_generation()
> spin_lock(mmu_lock)
> int generation = kvm->arch.mmu_generation;
>
> for_each_shadow_page(sp) {
> if (sp->generation == kvm->arch.mmu_generation)
> zap_page(sp)
> if (spin_needbreak(mmu_lock)) {
> kvm->arch.mmu_generation++;
> cond_resched_lock(mmu_lock);
> }
> }
>
> kvm_mmu_zap_all()
> spin_lock(mmu_lock)
> for_each_shadow_page(sp) {
> if (spin_needbreak(mmu_lock)) {
> cond_resched_lock(mmu_lock);
> }
> }
>
> Use kvm_mmu_zap_generation for kvm_arch_flush_shadow_memslot.
> Use kvm_mmu_zap_all for kvm_mmu_notifier_release,kvm_destroy_vm.
>
> This addresses the main problem: excessively long hold times
> of kvm_mmu_zap_all with very large guests.
>
> Do you see any problem with this logic? This was what i was thinking
> we agreed.

No. I understand it and it can work.

Actually, it is similar with Gleb's idea that "zapping stale shadow pages
(and uses lock break technique)", after some discussion, we thought "only zap
shadow pages that are reachable from the slot's rmap" is better, that is this
patchset does.
(https://lkml.org/lkml/2013/4/23/73)

>
> Step 2) Show that the optimization to zap only the roots is worthwhile
> via benchmarking, and implement it.

This is what i am confused. I can not understand how "zap only the roots"
works. You mean these change?

kvm_mmu_zap_generation()
spin_lock(mmu_lock)
int generation = kvm->arch.mmu_generation;

for_each_shadow_page(sp) {
/* Change here. */
=> if ((sp->generation == kvm->arch.mmu_generation) &&
=> sp->root_count)
zap_page(sp)

if (spin_needbreak(mmu_lock)) {
kvm->arch.mmu_generation++;
cond_resched_lock(mmu_lock);
}
}

If we do this, there will have shadow pages that are linked to invalid memslot's
rmap. How do we handle these pages and the mmu-notify issue?

Thanks!


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/