Re: [RFC PATCH] kvm/x86: Keep root hpa in prev_roots as much as possible

From: Lai Jiangshan
Date: Mon Aug 02 2021 - 21:19:31 EST


On Fri, Jul 30, 2021 at 2:06 AM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
>
> On Wed, May 26, 2021, Lai Jiangshan wrote:
> > From: Lai Jiangshan <laijs@xxxxxxxxxxxxxxxxx>
> >
> > Pagetable roots in prev_roots[] are likely to be reused soon and
> > there is no much overhead to keep it with a new need_sync field
> > introduced.
> >
> > With the help of the new need_sync field, pagetable roots are
> > kept as much as possible, and they will be re-synced before reused
> > instead of being dropped.
> >
> > Signed-off-by: Lai Jiangshan <laijs@xxxxxxxxxxxxxxxxx>
> > ---
> >
> > This patch is just for RFC.
> > Is the idea Ok?
>
> Yes, the idea is definitely a good one.
>
> > If the idea is Ok, we need to reused one bit from pgd or hpa
> > as need_sync to save memory. Which one is better?
>
> Ha, we can do this without increasing the memory footprint and without co-opting
> a bit from pgd or hpa. Because of compiler alignment/padding, the u8s and bools
> between mmu_role and prev_roots already occupy 8 bytes, even though the actual
> size is 4 bytes. In total, we need room for 4 roots (3 previous + current), i.e.
> 4 bytes. If a separate array is used, no additional memory is consumed and no
> masking is needed when reading/writing e.g. pgd.
>
> The cost is an extra swap() when updating the prev_roots LRU, but that's peanuts
> and would likely be offset by masking anyways.
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 99f37781a6fc..13bb3c3a60b4 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -424,10 +424,12 @@ struct kvm_mmu {
> hpa_t root_hpa;
> gpa_t root_pgd;
> union kvm_mmu_role mmu_role;
> + bool root_unsync;
> u8 root_level;
> u8 shadow_root_level;
> u8 ept_ad;
> bool direct_map;
> + bool unsync_roots[KVM_MMU_NUM_PREV_ROOTS];
> struct kvm_mmu_root_info prev_roots[KVM_MMU_NUM_PREV_ROOTS];
>

Hello

I think it is too complicated. And it is hard to accept to put "unsync"
out of struct kvm_mmu_root_info when they should be bound to each other.

How about this:
- KVM_MMU_NUM_PREV_ROOTS
+ KVM_MMU_NUM_CACHED_ROOTS
- mmu->prev_roots[KVM_MMU_NUM_PREV_ROOTS]
+ mmu->cached_roots[KVM_MMU_NUM_CACHED_ROOTS]
- mmu->root_hpa
+ mmu->cached_roots[0].hpa
- mmu->root_pgd
+ mmu->cached_roots[0].pgd

And using the bit63 in @pgd as the information that it is not requested
to sync since the last sync.

Thanks
Lai.

> /*
>
>
> > arch/x86/include/asm/kvm_host.h | 3 ++-
> > arch/x86/kvm/mmu/mmu.c | 6 ++++++
> > arch/x86/kvm/vmx/nested.c | 12 ++++--------
> > arch/x86/kvm/x86.c | 9 +++++----
> > 4 files changed, 17 insertions(+), 13 deletions(-)
> >
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index 55efbacfc244..19a337cf7aa6 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -354,10 +354,11 @@ struct rsvd_bits_validate {
> > struct kvm_mmu_root_info {
> > gpa_t pgd;
> > hpa_t hpa;
> > + bool need_sync;
>
> Hmm, use "unsync" instead of "need_sync", purely to match the existing terminology
> in KVM's MMU for this sort of behavior.
>
> > };
> >
> > #define KVM_MMU_ROOT_INFO_INVALID \
> > - ((struct kvm_mmu_root_info) { .pgd = INVALID_PAGE, .hpa = INVALID_PAGE })
> > + ((struct kvm_mmu_root_info) { .pgd = INVALID_PAGE, .hpa = INVALID_PAGE, .need_sync = true})
> >
> > #define KVM_MMU_NUM_PREV_ROOTS 3
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 5e60b00e8e50..147827135549 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -3878,6 +3878,7 @@ static bool cached_root_available(struct kvm_vcpu *vcpu, gpa_t new_pgd,
> >
> > root.pgd = mmu->root_pgd;
> > root.hpa = mmu->root_hpa;
> > + root.need_sync = false;
> >
> > if (is_root_usable(&root, new_pgd, new_role))
> > return true;
> > @@ -3892,6 +3893,11 @@ static bool cached_root_available(struct kvm_vcpu *vcpu, gpa_t new_pgd,
> > mmu->root_hpa = root.hpa;
> > mmu->root_pgd = root.pgd;
> >
> > + if (i < KVM_MMU_NUM_PREV_ROOTS && root.need_sync) {
>
> Probably makes sense to write this as:
>
> if (i >= KVM_MMU_NUM_PREV_ROOTS)
> return false;
>
> if (root.need_sync) {
> kvm_make_request(KVM_REQ_MMU_SYNC, vcpu);
> kvm_make_request(KVM_REQ_TLB_FLUSH_CURRENT, vcpu);
> }
> return true;
>
> The "i < KVM_MMU_NUM_PREV_ROOTS == success" logic is just confusing enough that
> it'd be nice to write it only once.
>
> And that would also play nicely with deferring a sync for the "current" root
> (see below), e.g.
>
> ...
> unsync = mmu->root_unsync;
>
> if (is_root_usable(&root, new_pgd, new_role))
> goto found_root;
>
> for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++) {
> swap(root, mmu->prev_roots[i]);
> swap(unsync, mmu->unsync_roots[i]);
>
> if (is_root_usable(&root, new_pgd, new_role))
> break;
> }
>
> if (i >= KVM_MMU_NUM_PREV_ROOTS)
> return false;
>
> found_root:
> if (unsync) {
> kvm_make_request(KVM_REQ_MMU_SYNC, vcpu);
> kvm_make_request(KVM_REQ_TLB_FLUSH_CURRENT, vcpu);
> }
> return true;
>
> > + kvm_make_request(KVM_REQ_MMU_SYNC, vcpu);
> > + kvm_make_request(KVM_REQ_TLB_FLUSH_CURRENT, vcpu);
> > + }
> > +
> > return i < KVM_MMU_NUM_PREV_ROOTS;
> > }
> >
> > diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> > index 6058a65a6ede..ab7069ac6dc5 100644
> > --- a/arch/x86/kvm/vmx/nested.c
> > +++ b/arch/x86/kvm/vmx/nested.c
> > @@ -5312,7 +5312,7 @@ static int handle_invept(struct kvm_vcpu *vcpu)
> > {
> > struct vcpu_vmx *vmx = to_vmx(vcpu);
> > u32 vmx_instruction_info, types;
> > - unsigned long type, roots_to_free;
> > + unsigned long type;
> > struct kvm_mmu *mmu;
> > gva_t gva;
> > struct x86_exception e;
> > @@ -5361,29 +5361,25 @@ static int handle_invept(struct kvm_vcpu *vcpu)
> > return nested_vmx_fail(vcpu,
> > VMXERR_INVALID_OPERAND_TO_INVEPT_INVVPID);
> >
> > - roots_to_free = 0;
> > if (nested_ept_root_matches(mmu->root_hpa, mmu->root_pgd,
> > operand.eptp))
> > - roots_to_free |= KVM_MMU_ROOT_CURRENT;
> > + kvm_mmu_free_roots(vcpu, mmu, KVM_MMU_ROOT_CURRENT);
>
> For a non-RFC series, I think this should do two things:
>
> 1. Separate INVEPT from INVPCID, i.e. do only INVPCID first.
> 2. Enhance INVEPT to SYNC+FLUSH the current root instead of freeing it
>
> As alluded to above, this can be done by deferring the sync+flush (which can't
> be done right away because INVEPT runs in L1 context, whereas KVM needs to sync+flush
> L2 EPT context).
>
> > for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++) {
> > if (nested_ept_root_matches(mmu->prev_roots[i].hpa,
> > mmu->prev_roots[i].pgd,
> > operand.eptp))
> > - roots_to_free |= KVM_MMU_ROOT_PREVIOUS(i);
> > + mmu->prev_roots[i].need_sync = true;
> > }
> > break;
> > case VMX_EPT_EXTENT_GLOBAL:
> > - roots_to_free = KVM_MMU_ROOTS_ALL;
> > + kvm_mmu_free_roots(vcpu, mmu, KVM_MMU_ROOTS_ALL);
> > break;
> > default:
> > BUG();
> > break;
> > }
> >
> > - if (roots_to_free)
> > - kvm_mmu_free_roots(vcpu, mmu, roots_to_free);
> > -
> > return nested_vmx_succeed(vcpu);
> > }