Re: [PATCH v19 070/130] KVM: TDX: TDP MMU TDX support

From: Isaku Yamahata
Date: Wed Apr 03 2024 - 14:35:50 EST


On Tue, Apr 02, 2024 at 05:13:23PM +0800,
Binbin Wu <binbin.wu@xxxxxxxxxxxxxxx> wrote:

>
>
> On 2/26/2024 4:26 PM, isaku.yamahata@xxxxxxxxx wrote:
> > From: Isaku Yamahata <isaku.yamahata@xxxxxxxxx>
> >
> > Implement hooks of TDP MMU for TDX backend. TLB flush, TLB shootdown,
> > propagating the change private EPT entry to Secure EPT and freeing Secure
> > EPT page. TLB flush handles both shared EPT and private EPT. It flushes
> > shared EPT same as VMX. It also waits for the TDX TLB shootdown. For the
> > hook to free Secure EPT page, unlinks the Secure EPT page from the Secure
> > EPT so that the page can be freed to OS.
> >
> > Propagate the entry change to Secure EPT. The possible entry changes are
> > present -> non-present(zapping) and non-present -> present(population). On
> > population just link the Secure EPT page or the private guest page to the
> > Secure EPT by TDX SEAMCALL. Because TDP MMU allows concurrent
> > zapping/population, zapping requires synchronous TLB shoot down with the
> > frozen EPT entry. It zaps the secure entry, increments TLB counter, sends
> > IPI to remote vcpus to trigger TLB flush, and then unlinks the private
> > guest page from the Secure EPT. For simplicity, batched zapping with
> > exclude lock is handled as concurrent zapping. Although it's inefficient,
> > it can be optimized in the future.
> >
> > For MMIO SPTE, the spte value changes as follows.
> > initial value (suppress VE bit is set)
> > -> Guest issues MMIO and triggers EPT violation
> > -> KVM updates SPTE value to MMIO value (suppress VE bit is cleared)
> > -> Guest MMIO resumes. It triggers VE exception in guest TD
> > -> Guest VE handler issues TDG.VP.VMCALL<MMIO>
> > -> KVM handles MMIO
> > -> Guest VE handler resumes its execution after MMIO instruction
> >
> > Signed-off-by: Isaku Yamahata <isaku.yamahata@xxxxxxxxx>
> >
> > ---
> > v19:
> > - Compile fix when CONFIG_HYPERV != y.
> > It's due to the following patch. Catch it up.
> > https://lore.kernel.org/all/20231018192325.1893896-1-seanjc@xxxxxxxxxx/
> > - Add comments on tlb shootdown to explan the sequence.
> > - Use gmem_max_level callback, delete tdp_max_page_level.
> >
> > v18:
> > - rename tdx_sept_page_aug() -> tdx_mem_page_aug()
> > - checkpatch: space => tab
> >
> > v15 -> v16:
> > - Add the handling of TD_ATTR_SEPT_VE_DISABLE case.
> >
> > v14 -> v15:
> > - Implemented tdx_flush_tlb_current()
> > - Removed unnecessary invept in tdx_flush_tlb(). It was carry over
> > from the very old code base.
> >
> > Signed-off-by: Isaku Yamahata <isaku.yamahata@xxxxxxxxx>
> > ---
> > arch/x86/kvm/mmu/spte.c | 3 +-
> > arch/x86/kvm/vmx/main.c | 91 ++++++++-
> > arch/x86/kvm/vmx/tdx.c | 372 +++++++++++++++++++++++++++++++++++++
> > arch/x86/kvm/vmx/tdx.h | 2 +-
> > arch/x86/kvm/vmx/tdx_ops.h | 6 +
> > arch/x86/kvm/vmx/x86_ops.h | 13 ++
> > 6 files changed, 481 insertions(+), 6 deletions(-)
> >
> [...]
>
> > +static int tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn,
> > + enum pg_level level)
> > +{
> > + int tdx_level = pg_level_to_tdx_sept_level(level);
> > + struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> > + gpa_t gpa = gfn_to_gpa(gfn) & KVM_HPAGE_MASK(level);
> > + struct tdx_module_args out;
> > + u64 err;
> > +
> > + /* This can be called when destructing guest TD after freeing HKID. */
> > + if (unlikely(!is_hkid_assigned(kvm_tdx)))
> > + return 0;
> > +
> > + /* For now large page isn't supported yet. */
> > + WARN_ON_ONCE(level != PG_LEVEL_4K);
> > + err = tdh_mem_range_block(kvm_tdx->tdr_pa, gpa, tdx_level, &out);
> > + if (unlikely(err == TDX_ERROR_SEPT_BUSY))
> > + return -EAGAIN;
> > + if (KVM_BUG_ON(err, kvm)) {
> > + pr_tdx_error(TDH_MEM_RANGE_BLOCK, err, &out);
> > + return -EIO;
> > + }
> > + return 0;
> > +}
> > +
> > +/*
> > + * TLB shoot down procedure:
> > + * There is a global epoch counter and each vcpu has local epoch counter.
> > + * - TDH.MEM.RANGE.BLOCK(TDR. level, range) on one vcpu
> > + * This blocks the subsequenct creation of TLB translation on that range.
> > + * This corresponds to clear the present bit(all RXW) in EPT entry
> > + * - TDH.MEM.TRACK(TDR): advances the epoch counter which is global.
> > + * - IPI to remote vcpus
> > + * - TDExit and re-entry with TDH.VP.ENTER on remote vcpus
> > + * - On re-entry, TDX module compares the local epoch counter with the global
> > + * epoch counter. If the local epoch counter is older than the global epoch
> > + * counter, update the local epoch counter and flushes TLB.
> > + */
> > +static void tdx_track(struct kvm *kvm)
> > +{
> > + struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> > + u64 err;
> > +
> > + KVM_BUG_ON(!is_hkid_assigned(kvm_tdx), kvm);
> > + /* If TD isn't finalized, it's before any vcpu running. */
> > + if (unlikely(!is_td_finalized(kvm_tdx)))
> > + return;
> > +
> > + /*
> > + * tdx_flush_tlb() waits for this function to issue TDH.MEM.TRACK() by
> > + * the counter. The counter is used instead of bool because multiple
> > + * TDH_MEM_TRACK() can be issued concurrently by multiple vcpus.
>
> Which case will have concurrent issues of TDH_MEM_TRACK() by multiple vcpus?
> For now, zapping is holding write lock.
> Promotion/demotion may have concurrent issues of TDH_MEM_TRACK(), but it's
> not supported yet.

You're right. Large page support will use it. With the assumption of only
single vcpu issuing tlb flush, The alternative is boolean + memory barrier.
I prefer to keep atomic_t and drop this comment than boolean + memory barrier
because we will eventually switch to atomic_t.


> > + *
> > + * optimization: The TLB shoot down procedure described in The TDX
> > + * specification is, TDH.MEM.TRACK(), send IPI to remote vcpus, confirm
> > + * all remote vcpus exit to VMM, and execute vcpu, both local and
> > + * remote. Twist the sequence to reduce IPI overhead as follows.
> > + *
> > + * local remote
> > + * ----- ------
> > + * increment tdh_mem_track
> > + *
> > + * request KVM_REQ_TLB_FLUSH
> > + * send IPI
> > + *
> > + * TDEXIT to KVM due to IPI
> > + *
> > + * IPI handler calls tdx_flush_tlb()
> > + * to process KVM_REQ_TLB_FLUSH.
> > + * spin wait for tdh_mem_track == 0
> > + *
> > + * TDH.MEM.TRACK()
> > + *
> > + * decrement tdh_mem_track
> > + *
> > + * complete KVM_REQ_TLB_FLUSH
> > + *
> > + * TDH.VP.ENTER to flush tlbs TDH.VP.ENTER to flush tlbs
> > + */
> > + atomic_inc(&kvm_tdx->tdh_mem_track);
> > + /*
> > + * KVM_REQ_TLB_FLUSH waits for the empty IPI handler, ack_flush(), with
> > + * KVM_REQUEST_WAIT.
> > + */
> > + kvm_make_all_cpus_request(kvm, KVM_REQ_TLB_FLUSH);
> > +
> > + do {
> > + err = tdh_mem_track(kvm_tdx->tdr_pa);
> > + } while (unlikely((err & TDX_SEAMCALL_STATUS_MASK) == TDX_OPERAND_BUSY));
> > +
> > + /* Release remote vcpu waiting for TDH.MEM.TRACK in tdx_flush_tlb(). */
> > + atomic_dec(&kvm_tdx->tdh_mem_track);
> > +
> > + if (KVM_BUG_ON(err, kvm))
> > + pr_tdx_error(TDH_MEM_TRACK, err, NULL);
> > +
> > +}
> > +
> > +static int tdx_sept_free_private_spt(struct kvm *kvm, gfn_t gfn,
> > + enum pg_level level, void *private_spt)
> > +{
> > + struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> > +
> > + /*
> > + * The HKID assigned to this TD was already freed and cache was
> > + * already flushed. We don't have to flush again.
> > + */
> > + if (!is_hkid_assigned(kvm_tdx))
> > + return tdx_reclaim_page(__pa(private_spt));
> > +
> > + /*
> > + * free_private_spt() is (obviously) called when a shadow page is being
> > + * zapped. KVM doesn't (yet) zap private SPs while the TD is active.
> > + * Note: This function is for private shadow page. Not for private
> > + * guest page. private guest page can be zapped during TD is active.
> > + * shared <-> private conversion and slot move/deletion.
> > + */
> > + KVM_BUG_ON(is_hkid_assigned(kvm_tdx), kvm);
>
> At this point, is_hkid_assigned(kvm_tdx) is always true.

Yes, will drop this KVM_BUG_ON().
--
Isaku Yamahata <isaku.yamahata@xxxxxxxxx>