Re: [PATCH 12/12] KVM: x86: do not unload MMU roots on all role changes

From: Sean Christopherson
Date: Mon Feb 14 2022 - 16:18:24 EST


On Mon, Feb 14, 2022, Paolo Bonzini wrote:
> On 2/11/22 19:48, Sean Christopherson wrote:
> > On Wed, Feb 09, 2022, Paolo Bonzini wrote:
> > > @@ -5045,8 +5046,8 @@ void kvm_mmu_after_set_cpuid(struct kvm_vcpu *vcpu)
> > > void kvm_mmu_reset_context(struct kvm_vcpu *vcpu)
> > > {
> > > - kvm_mmu_unload(vcpu);
> > > kvm_init_mmu(vcpu);
> > > + kvm_mmu_new_pgd(vcpu, vcpu->arch.cr3);
> >
> > This is too risky IMO, there are far more flows than just MOV CR0/CR4 that are
> > affected, e.g. SMM transitions, KVM_SET_SREG, etc...
>
> SMM exit does flush the TLB because RSM clears CR0.PG (I did check this :)).
> SMM re-entry then does not need to flush. But I don't think SMM exit should
> flush the TLB *for non-SMM roles*.

I'm not concerned about the TLB flush aspects so much as the addition of
kvm_mmu_new_pgd() in new paths.

> For KVM_SET_SREGS I'm not sure if it should flush the TLB, but I agree it is
> certainly safer to keep it that way.
>
> > Given that kvm_post_set_cr{0,4}() and kvm_vcpu_reset() explicitly handle CR0.PG
> > and CR4.SMEP toggling, I highly doubt the other flows are correct in all instances.
> > The call to kvm_mmu_new_pgd() is also
>
> *white noise*
>
> > To minimize risk, we should leave kvm_mmu_reset_context() as is (rename it if
> > necessary) and instead add a new helper to handle kvm_post_set_cr{0,4}(). In
> > the future we can/should work on avoiding unload in all paths, but again, future
> > problem.
>
> I disagree on this. There aren't many calls to kvm_mmu_reset_context.

All the more reason to do things incrementally. I have no objection to allowing
all flows to reuse a cached (or current) root, I'm objecting to converting them
all in a single patch.

> > > - if ((cr0 ^ old_cr0) & KVM_MMU_CR0_ROLE_BITS)
> > > + if ((cr0 ^ old_cr0) & KVM_MMU_CR0_ROLE_BITS) {
> > > + /* Flush the TLB if CR0 is changed 1 -> 0. */
> > > + if ((old_cr0 & X86_CR0_PG) && !(cr0 & X86_CR0_PG))
> > > + kvm_mmu_unload(vcpu);
> >
> > Calling kvm_mmu_unload() instead of requesting a flush isn't coherent with respect
> > to the comment, or with SMEP handling. And the SMEP handling isn't coherent with
> > respect to the changelog. Please elaborate :-)
>
> Yep, will do (the CR0.PG=0 case is similar to the CR0.PCIDE=0 case below).

Oh, you're freeing all roots to ensure a future MOV CR3 with NO_FLUSH and PCIDE=1
can't reuse a stale root. That's necessary if and only if the MMU is shadowing
the guest, non-nested TDP MMUs just need to flush the guest's TLB. The same is
true for the PCIDE case, i.e. we could optimize that too, though the main motivation
would be to clarify why all roots are unloaded.

> Using kvm_mmu_unload() avoids loading a cached root just to throw it away
> immediately after,

The shadow paging case will throw it away, but not the non-nested TDP MMU case?

> but I can change this to a new KVM_REQ_MMU_UPDATE_ROOT flag that does
>
> kvm_mmu_new_pgd(vcpu, vcpu->arch.cr3);

I don't think that's necessary, I was just confused by the discrepancy.

> By the way, I have a possibly stupid question. In kvm_set_cr3 (called e.g.
> from emulator_set_cr()) there is
>
> if (cr3 != kvm_read_cr3(vcpu))
> kvm_mmu_new_pgd(vcpu, cr3);
>
> What makes this work if mmu_is_nested(vcpu)?

Hmm, nothing. VMX is "broken" anyways because it will kick out to userspace with
X86EMUL_UNHANDLEABLE due to the CR3 intercept check. SVM is also broken in that
it doesn't check INTERCEPT_CR3_WRITE, e.g. will do the wrong thing even if L1 wants
to intercept CR3 accesses.

> Should this also have an "if (... & !tdp_enabled)"?

Yes? That should avoid the nested mess. This patch also needs to handle CR0 and
CR4 modifications if L2 is active, e.g. if L1 choose not to intercept CR0/CR4.
kvm_post_set_cr_reinit_mmu() would be a lovely landing spot for that check :-D