Re: [PATCH v6 3/7] KVM: MMU: fast invalidate all pages

From: Gleb Natapov
Date: Tue May 21 2013 - 04:39:16 EST


On Mon, May 20, 2013 at 05:40:47PM -0300, Marcelo Tosatti wrote:
> On Mon, May 20, 2013 at 11:15:45PM +0300, Gleb Natapov wrote:
> > On Mon, May 20, 2013 at 04:46:24PM -0300, Marcelo Tosatti wrote:
> > > On Fri, May 17, 2013 at 05:12:58AM +0800, Xiao Guangrong wrote:
> > > > The current kvm_mmu_zap_all is really slow - it is holding mmu-lock to
> > > > walk and zap all shadow pages one by one, also it need to zap all guest
> > > > page's rmap and all shadow page's parent spte list. Particularly, things
> > > > become worse if guest uses more memory or vcpus. It is not good for
> > > > scalability
> > > >
> > > > In this patch, we introduce a faster way to invalidate all shadow pages.
> > > > KVM maintains a global mmu invalid generation-number which is stored in
> > > > kvm->arch.mmu_valid_gen and every shadow page stores the current global
> > > > generation-number into sp->mmu_valid_gen when it is created
> > > >
> > > > When KVM need zap all shadow pages sptes, it just simply increase the
> > > > global generation-number then reload root shadow pages on all vcpus.
> > > > Vcpu will create a new shadow page table according to current kvm's
> > > > generation-number. It ensures the old pages are not used any more.
> > > > Then the invalid-gen pages (sp->mmu_valid_gen != kvm->arch.mmu_valid_gen)
> > > > are zapped by using lock-break technique
> > > >
> > > > Signed-off-by: Xiao Guangrong <xiaoguangrong@xxxxxxxxxxxxxxxxxx>
> > > > ---
> > > > arch/x86/include/asm/kvm_host.h | 2 +
> > > > arch/x86/kvm/mmu.c | 103 +++++++++++++++++++++++++++++++++++++++
> > > > arch/x86/kvm/mmu.h | 1 +
> > > > 3 files changed, 106 insertions(+), 0 deletions(-)
> > > >
> > > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > > > index 3741c65..bff7d46 100644
> > > > --- a/arch/x86/include/asm/kvm_host.h
> > > > +++ b/arch/x86/include/asm/kvm_host.h
> > > > @@ -222,6 +222,7 @@ struct kvm_mmu_page {
> > > > int root_count; /* Currently serving as active root */
> > > > unsigned int unsync_children;
> > > > unsigned long parent_ptes; /* Reverse mapping for parent_pte */
> > > > + unsigned long mmu_valid_gen;
> > > > DECLARE_BITMAP(unsync_child_bitmap, 512);
> > > >
> > > > #ifdef CONFIG_X86_32
> > > > @@ -529,6 +530,7 @@ struct kvm_arch {
> > > > unsigned int n_requested_mmu_pages;
> > > > unsigned int n_max_mmu_pages;
> > > > unsigned int indirect_shadow_pages;
> > > > + unsigned long mmu_valid_gen;
> > > > struct hlist_head mmu_page_hash[KVM_NUM_MMU_PAGES];
> > > > /*
> > > > * Hash table of struct kvm_mmu_page.
> > > > diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> > > > index 682ecb4..891ad2c 100644
> > > > --- a/arch/x86/kvm/mmu.c
> > > > +++ b/arch/x86/kvm/mmu.c
> > > > @@ -1839,6 +1839,11 @@ static void clear_sp_write_flooding_count(u64 *spte)
> > > > __clear_sp_write_flooding_count(sp);
> > > > }
> > > >
> > > > +static bool is_obsolete_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
> > > > +{
> > > > + return unlikely(sp->mmu_valid_gen != kvm->arch.mmu_valid_gen);
> > > > +}
> > > > +
> > > > static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> > > > gfn_t gfn,
> > > > gva_t gaddr,
> > > > @@ -1865,6 +1870,9 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> > > > role.quadrant = quadrant;
> > > > }
> > > > for_each_gfn_sp(vcpu->kvm, sp, gfn) {
> > > > + if (is_obsolete_sp(vcpu->kvm, sp))
> > > > + continue;
> > > > +
> > >
> > > Whats the purpose of not using pages which are considered "obsolete" ?
> > >
> > The same as not using page that is invalid, to not reuse stale
> > information. The page may contain ptes that point to invalid slot.
>
> Any pages with stale information will be zapped by kvm_mmu_zap_all().
> When that happens, page faults will take place which will automatically
> use the new generation number.
>
> So still not clear why is this necessary.
>
This is not, strictly speaking, necessary, but it is the sane thing to do.
You cannot update page's generation number to prevent it from been
destroyed since after kvm_mmu_zap_all() completes stale ptes in the
shadow page may point to now deleted memslot. So why build shadow page
table with a page that is in a process of been destroyed?

> > > The only purpose of the generation number should be to guarantee forward
> > > progress of kvm_mmu_zap_all in face of allocators (kvm_mmu_get_page).
> > >
> > > That is, on allocation you store the generation number on the shadow page.
> > > The only purpose of that number in the shadow page is so that
> > > kvm_mmu_zap_all can skip deleting shadow pages which have been created
> > > after kvm_mmu_zap_all began operation.
> > >
> > > > if (!need_sync && sp->unsync)
> > > > need_sync = true;
> > > >
> > > > @@ -1901,6 +1909,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> > > >
> > > > account_shadowed(vcpu->kvm, gfn);
> > > > }
> > > > + sp->mmu_valid_gen = vcpu->kvm->arch.mmu_valid_gen;
> > > > init_shadow_page_table(sp);
> > > > trace_kvm_mmu_get_page(sp, true);
> > > > return sp;
> > > > @@ -2071,8 +2080,10 @@ static int kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp,
> > > > ret = mmu_zap_unsync_children(kvm, sp, invalid_list);
> > > > kvm_mmu_page_unlink_children(kvm, sp);
> > > > kvm_mmu_unlink_parents(kvm, sp);
> > > > +
> > > > if (!sp->role.invalid && !sp->role.direct)
> > > > unaccount_shadowed(kvm, sp->gfn);
> > > > +
> > > > if (sp->unsync)
> > > > kvm_unlink_unsync_page(kvm, sp);
> > > >
> > > > @@ -4196,6 +4207,98 @@ restart:
> > > > spin_unlock(&kvm->mmu_lock);
> > > > }
> > > >
> > > > +static void kvm_zap_obsolete_pages(struct kvm *kvm)
> > > > +{
> > > > + struct kvm_mmu_page *sp, *node;
> > > > + LIST_HEAD(invalid_list);
> > > > +
> > > > +restart:
> > > > + list_for_each_entry_safe_reverse(sp, node,
> > > > + &kvm->arch.active_mmu_pages, link) {
> > > > + /*
> > > > + * No obsolete page exists before new created page since
> > > > + * active_mmu_pages is the FIFO list.
> > > > + */
> > > > + if (!is_obsolete_sp(kvm, sp))
> > > > + break;
> > > > +
> > > > + /*
> > > > + * Do not repeatedly zap a root page to avoid unnecessary
> > > > + * KVM_REQ_MMU_RELOAD, otherwise we may not be able to
> > > > + * progress:
> > > > + * vcpu 0 vcpu 1
> > > > + * call vcpu_enter_guest():
> > > > + * 1): handle KVM_REQ_MMU_RELOAD
> > > > + * and require mmu-lock to
> > > > + * load mmu
> > > > + * repeat:
> > > > + * 1): zap root page and
> > > > + * send KVM_REQ_MMU_RELOAD
> > > > + *
> > > > + * 2): if (cond_resched_lock(mmu-lock))
> > > > + *
> > > > + * 2): hold mmu-lock and load mmu
> > > > + *
> > > > + * 3): see KVM_REQ_MMU_RELOAD bit
> > > > + * on vcpu->requests is set
> > > > + * then return 1 to call
> > > > + * vcpu_enter_guest() again.
> > > > + * goto repeat;
> > > > + *
> > > > + */
> > > > + if (sp->role.invalid)
> > > > + continue;
> > > > + /*
> > > > + * Need not flush tlb since we only zap the sp with invalid
> > > > + * generation number.
> > > > + */
> > > > + if (cond_resched_lock(&kvm->mmu_lock))
> > > > + goto restart;
> > >
> > > Please don't do this, just release all pages and flush the TLB when
> > > dropping mmu_lock instead. This means for other mmu_lock users is that
> > > there is a consistent state.
> > You mean call kvm_mmu_commit_zap_page() on current invalid_list before restart?
> > If yes, I agree.
>
> I mean leaving pages which have been "partially deleted" while releasing
> mmu_lock. Say for example upon the need to delete a given page you do:
>
> spin_lock(mmu_lock)
> shadow page = lookup page on hash list
> if (shadow page is not found)
> proceed assuming the page has been properly deleted (*)
> spin_unlock(mmu_lock)
>
> At (*) we assume it has been 1) deleted and 2) TLB flushed.
>
> So its better to just
>
> if (need_resched()) {
> kvm_mmu_complete_zap_page(&list);
> cond_resched_lock(&kvm->mmu_lock);
> }
>
OK, this what I said above.

> If you want to collapse TLB flushes, please do it in a later patch.
>
Not sure what you mean again. We flush TLB once before entering this function.
kvm_reload_remote_mmus() does this for us, no?

> > > > + if (kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list))
> > > > + goto restart;
> > > > + }
> > > > +
> > > > + /*
> > > > + * Should flush tlb before free page tables since lockless-walking
> > > > + * may use the pages.
> > > > + */
> > > > + kvm_mmu_commit_zap_page(kvm, &invalid_list);
> > > > +}
> > > > +
> > > > +/*
> > > > + * Fast invalidate all shadow pages.
> > > > + *
> > > > + * @zap_obsolete_pages indicates whether all the obsolete pages should
> > > > + * be zapped. This is required when memslot is being deleted or VM is
> > > > + * being destroyed, in these cases, we should ensure that KVM MMU does
> > > > + * not use any resource of the being-deleted slot or all slots after
> > > > + * calling the function.
> > > > + *
> > > > + * @zap_obsolete_pages == false means the caller just wants to flush all
> > > > + * shadow page tables.
> > > > + */
> > > > +void kvm_mmu_invalidate_all_pages(struct kvm *kvm, bool zap_obsolete_pages)
> > > > +{
> > > > + spin_lock(&kvm->mmu_lock);
> > > > + kvm->arch.mmu_valid_gen++;
> > > > +
> > > > + /*
> > > > + * Notify all vcpus to reload its shadow page table
> > > > + * and flush TLB. Then all vcpus will switch to new
> > > > + * shadow page table with the new mmu_valid_gen.
> > >
> > > Only if you zap the roots, which we agreed would be a second step, after
> > > being understood its necessary.
> > >
> > I've lost you here. The patch implement what was agreed upon.
>
> "
> + /*
> + * Notify all vcpus to reload its shadow page table
> + * and flush TLB. Then all vcpus will switch to new
> + * shadow page table with the new mmu_valid_gen.
> "
>
> What was suggested was... go to phrase which starts with "The only purpose
> of the generation number should be to".
>
> The comment quoted here does not match that description.
>
The comment describes what code does and in this it is correct.

You propose to not reload roots right away and do it only when root sp
is encountered, right? So my question is what's the point? There are,
obviously, root sps with invalid generation number at this point, so
reload will happen regardless in kvm_mmu_prepare_zap_page(). So why not
do it here right away and avoid it in kvm_mmu_prepare_zap_page() for
invalid and obsolete sps as I proposed in one of my email?

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/