Re: [RFC PATCH v5 037/104] KVM: x86/mmu: Allow non-zero init value for shadow PTE
From: Kai Huang
Date: Fri Apr 01 2022 - 01:13:13 EST
On Fri, 2022-03-04 at 11:48 -0800, isaku.yamahata@xxxxxxxxx wrote:
> From: Sean Christopherson <sean.j.christopherson@xxxxxxxxx>
>
> TDX will run with EPT violation #VEs enabled for shared EPT, which means
> KVM needs to set the "suppress #VE" bit in unused PTEs to avoid
> unintentionally reflecting not-present EPT violations into the guest.
This sentence is hard to interpret. Please add more sentences to elaborate "TDX
will run with EPT violation #VEs enabled for shared EPT". Also, this patch is
the first time to introduce "shared EPT", perhaps you should also explain it
here. Or even you can move patch 43 ("KVM: TDX: Add load_mmu_pgd method for
TDX") before this one.
"reflecting non-present EPT violations into the guest" could be hard to
interpret. Perhaps you can be more explicit to say VMM wants to get EPT
violation for normal (shared) memory access rather than to cause #VE to guest.
Mentioning you want EPT violation instead of #VE for normal (shared) memory
access also completes your statement of wanting #VE for MMIO below, so that
people can have a clear picture when to get a #VE when not.
>
> Because guest memory is protected with TDX, VMM can't parse instructions
> in the guest memory. Instead, MMIO hypercall is used to pass necessary
> information to VMM.
>
> To make unmodified device driver work, guest TD expects #VE on accessing
> shared GPA. The #VE handler converts MMIO access into MMIO hypercall with
> the EPT entry of enabled "#VE" by clearing "suppress #VE" bit. Before VMM
> enabling #VE, it needs to figure out the given GPA is for MMIO by EPT
> violation. So the execution flow looks like
>
> - allocate unused shared EPT entry with suppress #VE bit set.
allocate -> Allocate
> - EPT violation on that GPA.
> - VMM figures out the faulted GPA is for MMIO.
> - VMM clears the suppress #VE bit.
> - Guest TD gets #VE, and converts MMIO access into MMIO hypercall.
Here you have described both normal memory access and MMIO, it's good time to
summarize the purpose of this patch: For both cases you want PTE with "suppress
#VE" bit set initially when it is allocated, therefore allow non-zero init value
for PTE.
>
> Signed-off-by: Sean Christopherson <sean.j.christopherson@xxxxxxxxx>
> Signed-off-by: Isaku Yamahata <isaku.yamahata@xxxxxxxxx>
> ---
> arch/x86/kvm/mmu.h | 1 +
> arch/x86/kvm/mmu/mmu.c | 50 +++++++++++++++++++++++++++++++++++------
> arch/x86/kvm/mmu/spte.c | 10 +++++++++
> arch/x86/kvm/mmu/spte.h | 2 ++
> 4 files changed, 56 insertions(+), 7 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> index 3fb530359f81..0ae91b8b25df 100644
> --- a/arch/x86/kvm/mmu.h
> +++ b/arch/x86/kvm/mmu.h
> @@ -66,6 +66,7 @@ static __always_inline u64 rsvd_bits(int s, int e)
>
> void kvm_mmu_set_mmio_spte_mask(u64 mmio_value, u64 mmio_mask, u64 access_mask);
> void kvm_mmu_set_ept_masks(bool has_ad_bits, bool has_exec_only);
> +void kvm_mmu_set_spte_init_value(u64 init_value);
>
> void kvm_init_mmu(struct kvm_vcpu *vcpu);
> void kvm_init_shadow_npt_mmu(struct kvm_vcpu *vcpu, unsigned long cr0,
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 9907cb759fd1..a474f2e76d78 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -617,9 +617,9 @@ static int mmu_spte_clear_track_bits(struct kvm *kvm, u64 *sptep)
> int level = sptep_to_sp(sptep)->role.level;
>
> if (!spte_has_volatile_bits(old_spte))
> - __update_clear_spte_fast(sptep, 0ull);
> + __update_clear_spte_fast(sptep, shadow_init_value);
> else
> - old_spte = __update_clear_spte_slow(sptep, 0ull);
> + old_spte = __update_clear_spte_slow(sptep, shadow_init_value);
I guess it's better to have some comment here. Allow non-zero init value for
shadow PTE doesn't necessarily mean the initial value should be used when one
PTE is zapped. I think mmu_spte_clear_track_bits() is only called for mapping
of normal (shared) memory but not MMIO? Then perhaps it's better to have a
comment to explain we want "suppress #VE" set to get a real EPT violation for
normal memory access from guest?
>
> if (!is_shadow_present_pte(old_spte))
> return old_spte;
> @@ -651,7 +651,7 @@ static int mmu_spte_clear_track_bits(struct kvm *kvm, u64 *sptep)
> */
> static void mmu_spte_clear_no_track(u64 *sptep)
> {
> - __update_clear_spte_fast(sptep, 0ull);
> + __update_clear_spte_fast(sptep, shadow_init_value);
> }
Similar here. Seems mmu_spte_clear_no_track() is used to zap non-leaf PTE which
doesn't require state tracking, so theoretically it can be set to 0. But this
seems is also called to zap MMIO PTE so looks need to set to shadow_init_value.
Anyway looks deserve a comment?
Btw, Above two changes to mmu_spte_clear_track_bits() and
mmu_spte_clear_track_bits() seems a little bit out-of-scope of what this patch
claims to do. Allow non-zero init value for shadow PTE doesn't necessarily mean
the initial value should be used when one PTE is zapped. Maybe we can further
improve the patch title and commit message a little bit. Such as: Allow non-
zero value for empty (or invalid?) PTE? Non-present seems doesn't fit here.
>
> static u64 mmu_spte_get_lockless(u64 *sptep)
> @@ -737,6 +737,42 @@ static void walk_shadow_page_lockless_end(struct kvm_vcpu *vcpu)
> }
> }
>
> +static inline void kvm_init_shadow_page(void *page)
> +{
> +#ifdef CONFIG_X86_64
> + int ign;
> +
> + asm volatile (
> + "rep stosq\n\t"
> + : "=c"(ign), "=D"(page)
> + : "a"(shadow_init_value), "c"(4096/8), "D"(page)
> + : "memory"
> + );
> +#else
> + BUG();
> +#endif
> +}
> +
> +static int mmu_topup_shadow_page_cache(struct kvm_vcpu *vcpu)
> +{
> + struct kvm_mmu_memory_cache *mc = &vcpu->arch.mmu_shadow_page_cache;
> + int start, end, i, r;
> +
> + if (shadow_init_value)
> + start = kvm_mmu_memory_cache_nr_free_objects(mc);
> +
> + r = kvm_mmu_topup_memory_cache(mc, PT64_ROOT_MAX_LEVEL);
> + if (r)
> + return r;
> +
> + if (shadow_init_value) {
> + end = kvm_mmu_memory_cache_nr_free_objects(mc);
> + for (i = start; i < end; i++)
> + kvm_init_shadow_page(mc->objects[i]);
> + }
> + return 0;
> +}
> +
> static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
> {
> int r;
> @@ -746,8 +782,7 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
> 1 + PT64_ROOT_MAX_LEVEL + PTE_PREFETCH_NUM);
> if (r)
> return r;
> - r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_shadow_page_cache,
> - PT64_ROOT_MAX_LEVEL);
> + r = mmu_topup_shadow_page_cache(vcpu);
> if (r)
> return r;
> if (maybe_indirect) {
> @@ -3146,7 +3181,7 @@ static int fast_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> {
> struct kvm_mmu_page *sp;
> int ret = RET_PF_INVALID;
> - u64 spte = 0ull;
> + u64 spte = shadow_init_value;
I don't quite understand this change. 'spte' is set to the last level PTE of
the given GFN if mapping is found. Otherwise fast_page_fault() returns
RET_PF_INVALID. In both cases, the initial value doesn't matter.
Am I wrong?
> u64 *sptep = NULL;
> uint retry_count = 0;
>
> @@ -5598,7 +5633,8 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu)
> vcpu->arch.mmu_page_header_cache.kmem_cache = mmu_page_header_cache;
> vcpu->arch.mmu_page_header_cache.gfp_zero = __GFP_ZERO;
>
> - vcpu->arch.mmu_shadow_page_cache.gfp_zero = __GFP_ZERO;
> + if (!shadow_init_value)
> + vcpu->arch.mmu_shadow_page_cache.gfp_zero = __GFP_ZERO;
>
> vcpu->arch.mmu = &vcpu->arch.root_mmu;
> vcpu->arch.walk_mmu = &vcpu->arch.root_mmu;
> diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
> index 73cfe62fdad1..5071e8332db2 100644
> --- a/arch/x86/kvm/mmu/spte.c
> +++ b/arch/x86/kvm/mmu/spte.c
> @@ -35,6 +35,7 @@ u64 __read_mostly shadow_mmio_access_mask;
> u64 __read_mostly shadow_present_mask;
> u64 __read_mostly shadow_me_mask;
> u64 __read_mostly shadow_acc_track_mask;
> +u64 __read_mostly shadow_init_value;
>
> u64 __read_mostly shadow_nonpresent_or_rsvd_mask;
> u64 __read_mostly shadow_nonpresent_or_rsvd_lower_gfn_mask;
> @@ -223,6 +224,14 @@ u64 kvm_mmu_changed_pte_notifier_make_spte(u64 old_spte, kvm_pfn_t new_pfn)
> return new_spte;
> }
>
> +void kvm_mmu_set_spte_init_value(u64 init_value)
> +{
> + if (WARN_ON(!IS_ENABLED(CONFIG_X86_64) && init_value))
> + init_value = 0;
> + shadow_init_value = init_value;
> +}
> +EXPORT_SYMBOL_GPL(kvm_mmu_set_spte_init_value);
> +
> static u8 kvm_get_shadow_phys_bits(void)
> {
> /*
> @@ -367,6 +376,7 @@ void kvm_mmu_reset_all_pte_masks(void)
> shadow_present_mask = PT_PRESENT_MASK;
> shadow_acc_track_mask = 0;
> shadow_me_mask = sme_me_mask;
> + shadow_init_value = 0;
>
> shadow_host_writable_mask = DEFAULT_SPTE_HOST_WRITEABLE;
> shadow_mmu_writable_mask = DEFAULT_SPTE_MMU_WRITEABLE;
> diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
> index be6a007a4af3..8e13a35ab8c9 100644
> --- a/arch/x86/kvm/mmu/spte.h
> +++ b/arch/x86/kvm/mmu/spte.h
> @@ -171,6 +171,8 @@ extern u64 __read_mostly shadow_mmio_access_mask;
> extern u64 __read_mostly shadow_present_mask;
> extern u64 __read_mostly shadow_me_mask;
>
> +extern u64 __read_mostly shadow_init_value;
> +
> /*
> * SPTEs in MMUs without A/D bits are marked with SPTE_TDP_AD_DISABLED_MASK;
> * shadow_acc_track_mask is the set of bits to be cleared in non-accessed