Re: [PATCH 09/21] KVM: TDX: Retry seamcall when TDX_OPERAND_BUSY with operand SEPT

From: Yan Zhao
Date: Sat Sep 14 2024 - 05:29:52 EST


On Fri, Sep 13, 2024 at 10:23:00AM -0700, Sean Christopherson wrote:
> On Fri, Sep 13, 2024, Yan Zhao wrote:
> > This is a lock status report of TDX module for current SEAMCALL retry issue
> > based on code in TDX module public repo https://github.com/intel/tdx-module.git
> > branch TDX_1.5.05.
> >
> > TL;DR:
> > - tdh_mem_track() can contend with tdh_vp_enter().
> > - tdh_vp_enter() contends with tdh_mem*() when 0-stepping is suspected.
>
> The zero-step logic seems to be the most problematic. E.g. if KVM is trying to
> install a page on behalf of two vCPUs, and KVM resumes the guest if it encounters
> a FROZEN_SPTE when building the non-leaf SPTEs, then one of the vCPUs could
> trigger the zero-step mitigation if the vCPU that "wins" and gets delayed for
> whatever reason.
>
> Since FROZEN_SPTE is essentially bit-spinlock with a reaaaaaly slow slow-path,
> what if instead of resuming the guest if a page fault hits FROZEN_SPTE, KVM retries
> the fault "locally", i.e. _without_ redoing tdh_vp_enter() to see if the vCPU still
> hits the fault?
>
> For non-TDX, resuming the guest and letting the vCPU retry the instruction is
> desirable because in many cases, the winning task will install a valid mapping
> before KVM can re-run the vCPU, i.e. the fault will be fixed before the
> instruction is re-executed. In the happy case, that provides optimal performance
> as KVM doesn't introduce any extra delay/latency.
>
> But for TDX, the math is different as the cost of a re-hitting a fault is much,
> much higher, especially in light of the zero-step issues.
>
> E.g. if the TDP MMU returns a unique error code for the frozen case, and
> kvm_mmu_page_fault() is modified to return the raw return code instead of '1',
> then the TDX EPT violation path can safely retry locally, similar to the do-while
> loop in kvm_tdp_map_page().
>
> The only part I don't like about this idea is having two "retry" return values,
> which creates the potential for bugs due to checking one but not the other.
>
> Hmm, that could be avoided by passing a bool pointer as an out-param to communicate
> to the TDX S-EPT fault handler that the SPTE is frozen. I think I like that
> option better even though the out-param is a bit gross, because it makes it more
> obvious that the "frozen_spte" is a special case that doesn't need attention for
> most paths.
Good idea.
But could we extend it a bit more to allow TDX's EPT violation handler to also
retry directly when tdh_mem_sept_add()/tdh_mem_page_aug() returns BUSY?

>
> > - tdg_mem_page_accept() can contend with other tdh_mem*().
> >
> > Proposal:
> > - Return -EAGAIN directly in ops link_external_spt/set_external_spte when
> > tdh_mem_sept_add()/tdh_mem_page_aug() returns BUSY.
> What is the result of returning -EAGAIN? E.g. does KVM redo tdh_vp_enter()?
Sorry, I meant -EBUSY originally.

With the current code in kvm_tdp_map_page(), vCPU should just retry without
tdh_vp_enter() except when there're signals pending.
With a real EPT violation, tdh_vp_enter() should be called again.

I realized that this is not good enough.
So, is it better to return -EAGAIN in ops link_external_spt/set_external_spte
and have kvm_tdp_mmu_map() return RET_PF_RETRY_FROZEN for -EAGAIN?
(or maybe some other name for RET_PF_RETRY_FROZEN).

> Also tdh_mem_sept_add() is strictly pre-finalize, correct? I.e. should never
> contend with tdg_mem_page_accept() because vCPUs can't yet be run.
tdh_mem_page_add() is pre-finalize, tdh_mem_sept_add() is not.
tdh_mem_sept_add() can be called runtime by tdp_mmu_link_sp().


> Similarly, can tdh_mem_page_aug() actually contend with tdg_mem_page_accept()?
> The page isn't yet mapped, so why would the guest be allowed to take a lock on
> the S-EPT entry?
Before tdg_mem_page_accept() accepts a gpa and set rwx bits in a SPTE, if second
tdh_mem_page_aug() is called on the same gpa, the second one may contend with
tdg_mem_page_accept().

But given KVM does not allow the second tdh_mem_page_aug(), looks the contention
between tdh_mem_page_aug() and tdg_mem_page_accept() will not happen.

>
> > - Kick off vCPUs at the beginning of page removal path, i.e. before the
> > tdh_mem_range_block().
> > Set a flag and disallow tdh_vp_enter() until tdh_mem_page_remove() is done.
>
> This is easy enough to do via a request, e.g. see KVM_REQ_MCLOCK_INPROGRESS.
Great!

>
> > (one possible optimization:
> > since contention from tdh_vp_enter()/tdg_mem_page_accept should be rare,
> > do not kick off vCPUs in normal conditions.
> > When SEAMCALL BUSY happens, retry for once, kick off vCPUs and do not allow
>
> Which SEAMCALL is this specifically? tdh_mem_range_block()?
Yes, they are
- tdh_mem_range_block() contends with tdh_vp_enter() for secure_ept_lock.
- tdh_mem_track() contends with tdh_vp_enter() for TD epoch.
(current code in MMU part 2 just retry tdh_mem_track() endlessly),
- tdh_mem_page_remove()/tdh_mem_range_block() contends with
tdg_mem_page_accept() for SEPT entry lock.
(this one should not happen on a sane guest).

Resources SHARED users EXCLUSIVE users
------------------------------------------------------------------------
(5) TDCS epoch tdh_vp_enter tdh_mem_track
------------------------------------------------------------------------
(6) secure_ept_lock tdh_mem_sept_add tdh_vp_enter
tdh_mem_page_aug tdh_mem_sept_remove
tdh_mem_page_remove tdh_mem_range_block
tdh_mem_range_unblock
------------------------------------------------------------------------
(7) SEPT entry tdh_mem_sept_add
tdh_mem_sept_remove
tdh_mem_page_aug
tdh_mem_page_remove
tdh_mem_range_block
tdh_mem_range_unblock
tdg_mem_page_accept


>
> > TD enter until page removal completes.)
>
>
> Idea #1:
> ---
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index b45258285c9c..8113c17bd2f6 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4719,7 +4719,7 @@ static int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code,
> return -EINTR;
> cond_resched();
> r = kvm_mmu_do_page_fault(vcpu, gpa, error_code, true, NULL, level);
> - } while (r == RET_PF_RETRY);
> + } while (r == RET_PF_RETRY || r == RET_PF_RETRY_FOZEN);
>
> if (r < 0)
> return r;
> @@ -6129,7 +6129,7 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
> vcpu->stat.pf_spurious++;
>
> if (r != RET_PF_EMULATE)
> - return 1;
> + return r;
>
> emulate:
> return x86_emulate_instruction(vcpu, cr2_or_gpa, emulation_type, insn,
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index 8d3fb3c8c213..690f03d7daae 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -256,12 +256,15 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
> * and of course kvm_mmu_do_page_fault().
> *
> * RET_PF_CONTINUE: So far, so good, keep handling the page fault.
> + * RET_PF_FIXED: The faulting entry has been fixed.
> * RET_PF_RETRY: let CPU fault again on the address.
> + * RET_PF_RETRY_FROZEN: One or more SPTEs related to the address is frozen.
> + * Let the CPU fault again on the address, or retry the
> + * fault "locally", i.e. without re-entering the guest.
> * RET_PF_EMULATE: mmio page fault, emulate the instruction directly.
> * RET_PF_WRITE_PROTECTED: the gfn is write-protected, either unprotected the
> * gfn and retry, or emulate the instruction directly.
> * RET_PF_INVALID: the spte is invalid, let the real page fault path update it.
> - * RET_PF_FIXED: The faulting entry has been fixed.
> * RET_PF_SPURIOUS: The faulting entry was already fixed, e.g. by another vCPU.
> *
> * Any names added to this enum should be exported to userspace for use in
> @@ -271,14 +274,18 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
> * on -errno return values. Somewhat arbitrarily use '0' for CONTINUE, which
> * will allow for efficient machine code when checking for CONTINUE, e.g.
> * "TEST %rax, %rax, JNZ", as all "stop!" values are non-zero.
> + *
> + * Note #2, RET_PF_FIXED _must_ be '1', so that KVM's -errno/0/1 return code
> + * scheme, where 1==success, translates '1' to RET_PF_FIXED.
> */
Looks "r > 0" represents success in vcpu_run()?
So, moving RET_PF_FIXED to 1 is not necessary?

> enum {
> RET_PF_CONTINUE = 0,
> + RET_PF_FIXED = 1,
> RET_PF_RETRY,
> + RET_PF_RETRY_FROZEN,
> RET_PF_EMULATE,
> RET_PF_WRITE_PROTECTED,
> RET_PF_INVALID,
> - RET_PF_FIXED,
> RET_PF_SPURIOUS,
> };
>
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 5a475a6456d4..cbf9e46203f3 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1174,6 +1174,8 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>
> retry:
> rcu_read_unlock();
> + if (ret == RET_PF_RETRY && is_frozen_spte(iter.old_spte))
> + return RET_PF_RETRY_FOZEN;
> return ret;
> }
>
> ---
>
>
> Idea #2:
> ---
> arch/x86/include/asm/kvm_host.h | 2 +-
> arch/x86/kvm/mmu/mmu.c | 12 ++++++------
> arch/x86/kvm/mmu/mmu_internal.h | 15 ++++++++++++---
> arch/x86/kvm/mmu/tdp_mmu.c | 1 +
> arch/x86/kvm/svm/svm.c | 2 +-
> arch/x86/kvm/vmx/vmx.c | 4 ++--
> 6 files changed, 23 insertions(+), 13 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 46e0a466d7fb..200fecd1de88 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -2183,7 +2183,7 @@ unsigned long __kvm_emulate_hypercall(struct kvm_vcpu *vcpu, unsigned long nr,
> int kvm_emulate_hypercall(struct kvm_vcpu *vcpu);
>
> int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 error_code,
> - void *insn, int insn_len);
> + void *insn, int insn_len, bool *frozen_spte);
> void kvm_mmu_print_sptes(struct kvm_vcpu *vcpu, gpa_t gpa, const char *msg);
> void kvm_mmu_invlpg(struct kvm_vcpu *vcpu, gva_t gva);
> void kvm_mmu_invalidate_addr(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index b45258285c9c..207840a316d3 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4283,7 +4283,7 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
> return;
>
> r = kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, work->arch.error_code,
> - true, NULL, NULL);
> + true, NULL, NULL, NULL);
>
> /*
> * Account fixed page faults, otherwise they'll never be counted, but
> @@ -4627,7 +4627,7 @@ int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
> trace_kvm_page_fault(vcpu, fault_address, error_code);
>
> r = kvm_mmu_page_fault(vcpu, fault_address, error_code, insn,
> - insn_len);
> + insn_len, NULL);
> } else if (flags & KVM_PV_REASON_PAGE_NOT_PRESENT) {
> vcpu->arch.apf.host_apf_flags = 0;
> local_irq_disable();
> @@ -4718,7 +4718,7 @@ static int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code,
> if (signal_pending(current))
> return -EINTR;
> cond_resched();
> - r = kvm_mmu_do_page_fault(vcpu, gpa, error_code, true, NULL, level);
> + r = kvm_mmu_do_page_fault(vcpu, gpa, error_code, true, NULL, level, NULL);
> } while (r == RET_PF_RETRY);
>
> if (r < 0)
> @@ -6073,7 +6073,7 @@ static int kvm_mmu_write_protect_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> }
>
> int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 error_code,
> - void *insn, int insn_len)
> + void *insn, int insn_len, bool *frozen_spte)
> {
> int r, emulation_type = EMULTYPE_PF;
> bool direct = vcpu->arch.mmu->root_role.direct;
> @@ -6109,7 +6109,7 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
> vcpu->stat.pf_taken++;
>
> r = kvm_mmu_do_page_fault(vcpu, cr2_or_gpa, error_code, false,
> - &emulation_type, NULL);
> + &emulation_type, NULL, frozen_spte);
> if (KVM_BUG_ON(r == RET_PF_INVALID, vcpu->kvm))
> return -EIO;
> }
> @@ -6129,7 +6129,7 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
> vcpu->stat.pf_spurious++;
>
> if (r != RET_PF_EMULATE)
> - return 1;
> + return r;
>
> emulate:
> return x86_emulate_instruction(vcpu, cr2_or_gpa, emulation_type, insn,
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index 8d3fb3c8c213..5b1fc77695c1 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -247,6 +247,9 @@ struct kvm_page_fault {
> * is changing its own translation in the guest page tables.
> */
> bool write_fault_to_shadow_pgtable;
> +
> + /* Indicates the page fault needs to be retried due to a frozen SPTE. */
> + bool frozen_spte;
> };
>
> int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
> @@ -256,12 +259,12 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
> * and of course kvm_mmu_do_page_fault().
> *
> * RET_PF_CONTINUE: So far, so good, keep handling the page fault.
> + * RET_PF_FIXED: The faulting entry has been fixed.
> * RET_PF_RETRY: let CPU fault again on the address.
> * RET_PF_EMULATE: mmio page fault, emulate the instruction directly.
> * RET_PF_WRITE_PROTECTED: the gfn is write-protected, either unprotected the
> * gfn and retry, or emulate the instruction directly.
> * RET_PF_INVALID: the spte is invalid, let the real page fault path update it.
> - * RET_PF_FIXED: The faulting entry has been fixed.
> * RET_PF_SPURIOUS: The faulting entry was already fixed, e.g. by another vCPU.
> *
> * Any names added to this enum should be exported to userspace for use in
> @@ -271,14 +274,17 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
> * on -errno return values. Somewhat arbitrarily use '0' for CONTINUE, which
> * will allow for efficient machine code when checking for CONTINUE, e.g.
> * "TEST %rax, %rax, JNZ", as all "stop!" values are non-zero.
> + *
> + * Note #2, RET_PF_FIXED _must_ be '1', so that KVM's -errno/0/1 return code
> + * scheme, where 1==success, translates '1' to RET_PF_FIXED.
> */
> enum {
> RET_PF_CONTINUE = 0,
> + RET_PF_FIXED = 1,
> RET_PF_RETRY,
> RET_PF_EMULATE,
> RET_PF_WRITE_PROTECTED,
> RET_PF_INVALID,
> - RET_PF_FIXED,
> RET_PF_SPURIOUS,
> };
>
> @@ -292,7 +298,8 @@ static inline void kvm_mmu_prepare_memory_fault_exit(struct kvm_vcpu *vcpu,
>
> static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> u64 err, bool prefetch,
> - int *emulation_type, u8 *level)
> + int *emulation_type, u8 *level,
> + bool *frozen_spte)
> {
> struct kvm_page_fault fault = {
> .addr = cr2_or_gpa,
> @@ -341,6 +348,8 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> *emulation_type |= EMULTYPE_WRITE_PF_TO_SP;
> if (level)
> *level = fault.goal_level;
> + if (frozen_spte)
> + *frozen_spte = fault.frozen_spte;
>
> return r;
> }
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 5a475a6456d4..e7fc5ea4b437 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1174,6 +1174,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>
> retry:
> rcu_read_unlock();
> + fault->frozen_spte = is_frozen_spte(iter.old_spte);
> return ret;
> }
>
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index 38723b0c435d..269de6a9eb13 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -2075,7 +2075,7 @@ static int npf_interception(struct kvm_vcpu *vcpu)
> rc = kvm_mmu_page_fault(vcpu, fault_address, error_code,
> static_cpu_has(X86_FEATURE_DECODEASSISTS) ?
> svm->vmcb->control.insn_bytes : NULL,
> - svm->vmcb->control.insn_len);
> + svm->vmcb->control.insn_len, NULL);
>
> if (rc > 0 && error_code & PFERR_GUEST_RMP_MASK)
> sev_handle_rmp_fault(vcpu, fault_address, error_code);
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 368acfebd476..fc2ff5d91a71 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -5822,7 +5822,7 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
> if (unlikely(allow_smaller_maxphyaddr && !kvm_vcpu_is_legal_gpa(vcpu, gpa)))
> return kvm_emulate_instruction(vcpu, 0);
>
> - return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
> + return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0, NULL);
> }
>
> static int handle_ept_misconfig(struct kvm_vcpu *vcpu)
> @@ -5843,7 +5843,7 @@ static int handle_ept_misconfig(struct kvm_vcpu *vcpu)
> return kvm_skip_emulated_instruction(vcpu);
> }
>
> - return kvm_mmu_page_fault(vcpu, gpa, PFERR_RSVD_MASK, NULL, 0);
> + return kvm_mmu_page_fault(vcpu, gpa, PFERR_RSVD_MASK, NULL, 0, NULL);
> }
>
> static int handle_nmi_window(struct kvm_vcpu *vcpu)
>
> base-commit: bc87a2b4b5508d247ed2c30cd2829969d168adfe
> --
>