Re: [PATCH 3/4] KVM: nSVM: invalidate cached PDPTRs across nested NPT transitions
From: Jim Mattson
Date: Thu May 28 2026 - 14:39:41 EST
On Thu, May 28, 2026 at 1:44 AM Paolo Bonzini <pbonzini@xxxxxxxxxx> wrote:
>
> On 5/28/26 01:53, Sean Christopherson wrote:
> > I would rather use kvm_clear_available_registers() than add yet another API,
> > which isn't even a good fit here since SVM never expects the PDPTRs to be dirty.
> >
> > Though I think it's a moot point, because nSVM should be clearing *all*
> > lazy-loaded registers. It just so happens that PDPTRs are the only such "register".
> >
> > I haven't checked to see if this would actually be correct, I'm just mimicking
> > the nVMX code. But conceptually, I think we want something like so:
> > diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> > index e74fcde6155e..0c6ab00766b1 100644
> > --- a/arch/x86/kvm/svm/svm.c
> > +++ b/arch/x86/kvm/svm/svm.c
> > @@ -1303,6 +1303,8 @@ void svm_switch_vmcb(struct vcpu_svm *svm, struct kvm_vmcb_info *target_vmcb)
> > {
> > svm->current_vmcb = target_vmcb;
> > svm->vmcb = target_vmcb->ptr;
> > +
> > + kvm_clear_available_registers(&svm->vcpu, SVM_REGS_LAZY_LOAD_SET);
> > }
> >
> > static int svm_vcpu_precreate(struct kvm *kvm)
>
> I actually started with something like that. nSVM should indeed be
> clearing all other lazy-loaded registers too (except there isn't any).
> I ended up doing it the other way for two reasons, both boiling down to
> PDPTRs being weird.
>
> First, the PDPTRs are cached processor state but not a field of the
> VMCB; changing the VMCB should have no effect on them. For SVM, changes
> to their cache state are purely a result of writes to CR3 or CR4.PAE.
Huh?
The SDM says, "The behavior of PAE mode in a nested-paging guest
differs slightly from the behavior of (host-only) legacy PAE mode, in
that the guest’s four PDPEs are not loaded into the processor at the
time CR3 is written. Instead, the PDPEs are accessed on demand as part
of a table walk. This has the side-effect that illegal bit
combinations in the PDPEs are not signaled at the time that CR3 is
written, but instead when the faulty PDPE is accessed as part of a
table walk.
So, they are only cached as part of a partial walk, just like the
entries at any other level. And, unlike Intel, changes to the PDPTRs
in memory *may* be visible in a future page walk without changing CR3.
> Thus it seemed more explicit to single out the PAE+nNPT case when
> loading CR3:
>
> if (nested_npt)
> kvm_register_mark_for_reload
> else if (!load_pdptrs(...))
> return -EINVAL;
>
> instead of the slightly more opaque
>
> if (!nested_npt && !load_pdptrs(...))
> return -EINVAL;
>
> Second, with nNPT it's not even entirely correct to cache them at all.
> Strictly speaking they should never even be marked available, that way
> you'd go through load_pdptrs() on every PAE memory access. Thus
> kvm_register_mark_for_reload() could also be tied to the CR4 write, but
> in any case not to the VMCB switch.
>
> I'd rather keep it like this but yeah, it can be done either way. If
> there were other registers in the lazy load set, relying on
> svm_switch_vmcb() would make sense but you'd want to add a comment above
> the two instances of "if (!nested_npt && !load_pdptrs())".
>
> Paolo
>
>