Re: [PATCH 09/12] KVM: guest_memfd: move check for already-populated page to common code

From: Michael Roth
Date: Mon Jul 15 2024 - 20:15:44 EST


On Mon, Jul 15, 2024 at 10:57:35PM +0000, Edgecombe, Rick P wrote:
> On Mon, 2024-07-15 at 16:47 -0500, Michael Roth wrote:
> > Makes sense.
> >
> > If we document mutual exclusion between ranges touched by
> > gmem_populate() vs ranges touched by actual userspace issuance of
> > KVM_PRE_FAULT_MEMORY (and make sure nothing too crazy happens if users
> > don't abide by the documentation), then I think most problems go away...
> >
> > But there is still at least one awkward constraint for SNP:
> > KVM_PRE_FAULT_MEMORY cannot be called for private GPA ranges until after
> > SNP_LAUNCH_START is called. This is true even if the GPA range is not
> > one of the ranges that will get passed to
> > gmem_populate()/SNP_LAUNCH_UPDATE. The reason for this is that when
> > binding the ASID to the SNP context as part of SNP_LAUNCH_START, firmware
> > will perform checks to make sure that ASID is not already being used in
> > the RMP table, and that check will fail if KVM_PRE_FAULT_MEMORY triggered
> > for a private page before calling SNP_LAUNCH_START.
> >
> > So we have this constraint that KVM_PRE_FAULT_MEMORY can't be issued
> > before SNP_LAUNCH_START. So it makes me wonder if we should just broaden
> > that constraint and for now just disallow KVM_PRE_FAULT_MEMORY prior to
> > finalizing a guest, since it'll be easier to lift that restriction later
> > versus discovering some other sort of edge case and need to
> > retroactively place restrictions.
> >
> > I've taken Isaku's original pre_fault_memory_test and added a new
> > x86-specific coco_pre_fault_memory_test to try to better document and
> > exercise these corner cases for SEV and SNP, but I'm hoping it could
> > also be useful for TDX (hence the generic name). These are based on
> > Pratik's initial SNP selftests (which are in turn based on kvm/queue +
> > these patches):
> >
> >   https://github.com/mdroth/linux/commits/snp-uptodate0-kst/
> >  
> > https://github.com/mdroth/linux/commit/dd7d4b1983ceeb653132cfd54ad63f597db85f74
> >
> >
>
> From the TDX side it wouldn't be horrible to not have to worry about userspace
> mucking around with the mirrored page tables in unexpected ways during the
> special period. TDX already has its own "finalized" state in kvm_tdx, is there
> something similar on the SEV side we could unify with?

Unfortunately there isn't currently anything in place like that for SNP,
but if we had a common 'finalized' field somewhere it could easily be set in
snp_launch_finish() as well.

>
> I looked at moving from kvm_arch_vcpu_pre_fault_memory() to directly calling
> kvm_tdp_map_page(), so we could potentially put whatever check in
> kvm_arch_vcpu_pre_fault_memory(). It required a couple exports:

Thanks. I'll give this a spin for SNP as well.

-Mike

>
> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> index 03737f3aaeeb..9004ac597a85 100644
> --- a/arch/x86/kvm/mmu.h
> +++ b/arch/x86/kvm/mmu.h
> @@ -277,6 +277,7 @@ extern bool tdp_mmu_enabled;
> #endif
>
> int kvm_tdp_mmu_get_walk_mirror_pfn(struct kvm_vcpu *vcpu, u64 gpa, kvm_pfn_t
> *pfn);
> +int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8
> *level);
>
> static inline bool kvm_memslots_have_rmaps(struct kvm *kvm)
> {
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 7bb6b17b455f..4a3e471ec9fe 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4721,8 +4721,7 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct
> kvm_page_fault *fault)
> return direct_page_fault(vcpu, fault);
> }
>
> -static int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code,
> - u8 *level)
> +int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8
> *level)
> {
> int r;
>
> @@ -4759,6 +4758,7 @@ static int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t
> gpa, u64 error_code,
> return -EIO;
> }
> }
> +EXPORT_SYMBOL_GPL(kvm_tdp_map_page);
>
> long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
> struct kvm_pre_fault_memory *range)
> @@ -5770,6 +5770,7 @@ int kvm_mmu_load(struct kvm_vcpu *vcpu)
> out:
> return r;
> }
> +EXPORT_SYMBOL_GPL(kvm_mmu_load);
>
> void kvm_mmu_unload(struct kvm_vcpu *vcpu)
> {
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 9ac0821eb44b..7161ef68f3da 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -2809,11 +2809,13 @@ struct tdx_gmem_post_populate_arg {
> static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
> void __user *src, int order, void *_arg)
> {
> + u64 error_code = PFERR_GUEST_FINAL_MASK | PFERR_PRIVATE_ACCESS;
> struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> struct tdx_gmem_post_populate_arg *arg = _arg;
> struct kvm_vcpu *vcpu = arg->vcpu;
> struct kvm_memory_slot *slot;
> gpa_t gpa = gfn_to_gpa(gfn);
> + u8 level = PG_LEVEL_4K;
> struct page *page;
> kvm_pfn_t mmu_pfn;
> int ret, i;
> @@ -2832,6 +2834,10 @@ static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t
> gfn, kvm_pfn_t pfn,
> goto out_put_page;
> }
>
> + ret = kvm_tdp_map_page(vcpu, gpa, error_code, &level);
> + if (ret < 0)
> + goto out_put_page;
> +
> read_lock(&kvm->mmu_lock);
>
> ret = kvm_tdp_mmu_get_walk_mirror_pfn(vcpu, gpa, &mmu_pfn);
> @@ -2910,6 +2916,7 @@ static int tdx_vcpu_init_mem_region(struct kvm_vcpu *vcpu,
> struct kvm_tdx_cmd *c
> mutex_lock(&kvm->slots_lock);
> idx = srcu_read_lock(&kvm->srcu);
>
> + kvm_mmu_reload(vcpu);
> ret = 0;
> while (region.nr_pages) {
> if (signal_pending(current)) {
>