Re: [PATCH v4 3/5] KVM: SVM: Fix nested NPF injection of PFERR_GUEST_{PAGE,FINAL}_MASK bits

From: Sean Christopherson

Date: Tue May 26 2026 - 14:50:20 EST

On Tue, May 26, 2026, Yosry Ahmed wrote:
> On Fri, May 22, 2026 at 4:27 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
> >
> > From: Kevin Cheng <chengkev@xxxxxxxxxx>
> >
> > Fix KVM's generation of PFERR_GUEST_{PAGE,FINAL}_MASK bits when injecting a
> > Nested Page Fault into L1. Currently, KVM blindly stuffs GUEST_FINAL into
> > L1, which is blatantly wrong given that KVM obviously generates NPFs for
> > page table accesses.
> >
> > There are two paths that trigger NPF injection: hardware NPF exits (from
> > L2) and emulation-triggered faults, i.e. when KVM detects a NPF as part of
> > emulating an L2 GVA access. For the hardware case, use the bits verbatim
> > from the VMCB, as KVM is simply forwarding a NPF to L1. For the emulation
> > case, propagate the GUEST_{PAGE,FINAL} bits from the access field (which
> > were recently added for MBEC+GMET support).
> >
> > To differentiate between the two cases, add "hardware_nested_page_fault"
> > to "struct x86_exception", and set it when injecting a NPF in response to
> > an NPF exit from L2.
>
> hardware_nested_page_fault is no more.

Hrm, I suspect I unintentionally discarded a changelog update, I distinctly
remember rewriting this. *sigh*

> > To help guard against future goofs, assert that exactly one of GUEST_PAGE
> > or GUEST_FINAL is set when injecting a NPF. Unlike VMX, there are no
> > (known) cases where hardware doesn't set either bit, and KVM should always
> > set one or the other when emulating a GVA access.
> >
> > Signed-off-by: Kevin Cheng <chengkev@xxxxxxxxxx>
> > [sean: use plumbed in @access bits, massage changelog]
> > Signed-off-by: Sean Christopherson <seanjc@xxxxxxxxxx>
> [..]
> > @@ -39,19 +39,32 @@ static void nested_svm_inject_npf_exit(struct kvm_vcpu *vcpu,
> > {
> > struct vcpu_svm *svm = to_svm(vcpu);
> > struct vmcb *vmcb = svm->vmcb;
> > + u64 fault_stage;
> >
> > - if (vmcb->control.exit_code != SVM_EXIT_NPF) {
> > - /*
> > - * TODO: track the cause of the nested page fault, and
> > - * correctly fill in the high bits of exit_info_1.
> > - */
> > - vmcb->control.exit_code = SVM_EXIT_NPF;
> > - vmcb->control.exit_info_1 = (1ULL << 32);
> > - vmcb->control.exit_info_2 = fault->address;
> > - }
> > + /*
> > + * For hardware NPF exits, the GUEST_FAULT_STAGE bits are only
> > + * available in the hardware exit_info_1, since the guest_mmu
> > + * walker doesn't know whether the faulting GPA was a page table
> > + * page or final page from L2's perspective.
> > + */
> > + if (from_hardware)
> > + fault_stage = vmcb->control.exit_info_1 &
> > + PFERR_GUEST_FAULT_STAGE_MASK;
> > + else
> > + fault_stage = fault->error_code & PFERR_GUEST_FAULT_STAGE_MASK;
> >
> > - vmcb->control.exit_info_1 &= ~0xffffffffULL;
> > - vmcb->control.exit_info_1 |= fault->error_code;
> > + /*
> > + * All nested page faults should be annotated as occurring on the
> > + * final translation *or* the page walk. Arbitrarily choose "final"
> > + * if KVM is buggy and enumerated both or neither.
> > + */
> > + if (WARN_ON_ONCE(hweight64(fault_stage) != 1))
> > + fault_stage = PFERR_GUEST_FINAL_MASK;
> > +
> > + vmcb->control.exit_code = SVM_EXIT_NPF;
> > + vmcb->control.exit_info_1 = fault_stage |
> > + (fault->error_code & ~PFERR_GUEST_FAULT_STAGE_MASK);
>
> Do we need to do this in the common path?

What do you mean by "this"? Pulling flags from fault->error_code?

> If from_hardware=true, can the fault injected by KVM have different flags
> from the one produced by hardware?

Flags, yes. fault_stage, no.

> I guess the answer is yes, (e.g. if KVM is doing write-protection?). Might be
> worth a comment.

Or if L1 has modified its TDP PTEs in memory, but hasn't yet flushed TLBs. In
that case, KVM's software walker can see the updated PTEs, while hardware may
have seen something else.