Re: [PATCH 09/12] KVM: guest_memfd: move check for already-populated page to common code

From: Edgecombe, Rick P
Date: Mon Jul 15 2024 - 18:57:56 EST


On Mon, 2024-07-15 at 16:47 -0500, Michael Roth wrote:
> Makes sense.
>
> If we document mutual exclusion between ranges touched by
> gmem_populate() vs ranges touched by actual userspace issuance of
> KVM_PRE_FAULT_MEMORY (and make sure nothing too crazy happens if users
> don't abide by the documentation), then I think most problems go away...
>
> But there is still at least one awkward constraint for SNP:
> KVM_PRE_FAULT_MEMORY cannot be called for private GPA ranges until after
> SNP_LAUNCH_START is called. This is true even if the GPA range is not
> one of the ranges that will get passed to
> gmem_populate()/SNP_LAUNCH_UPDATE. The reason for this is that when
> binding the ASID to the SNP context as part of SNP_LAUNCH_START, firmware
> will perform checks to make sure that ASID is not already being used in
> the RMP table, and that check will fail if KVM_PRE_FAULT_MEMORY triggered
> for a private page before calling SNP_LAUNCH_START.
>
> So we have this constraint that KVM_PRE_FAULT_MEMORY can't be issued
> before SNP_LAUNCH_START. So it makes me wonder if we should just broaden
> that constraint and for now just disallow KVM_PRE_FAULT_MEMORY prior to
> finalizing a guest, since it'll be easier to lift that restriction later
> versus discovering some other sort of edge case and need to
> retroactively place restrictions.
>
> I've taken Isaku's original pre_fault_memory_test and added a new
> x86-specific coco_pre_fault_memory_test to try to better document and
> exercise these corner cases for SEV and SNP, but I'm hoping it could
> also be useful for TDX (hence the generic name). These are based on
> Pratik's initial SNP selftests (which are in turn based on kvm/queue +
> these patches):
>
>   https://github.com/mdroth/linux/commits/snp-uptodate0-kst/
>  
> https://github.com/mdroth/linux/commit/dd7d4b1983ceeb653132cfd54ad63f597db85f74
>
>

From the TDX side it wouldn't be horrible to not have to worry about userspace
mucking around with the mirrored page tables in unexpected ways during the
special period. TDX already has its own "finalized" state in kvm_tdx, is there
something similar on the SEV side we could unify with?

I looked at moving from kvm_arch_vcpu_pre_fault_memory() to directly calling
kvm_tdp_map_page(), so we could potentially put whatever check in
kvm_arch_vcpu_pre_fault_memory(). It required a couple exports:

diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 03737f3aaeeb..9004ac597a85 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -277,6 +277,7 @@ extern bool tdp_mmu_enabled;
#endif

int kvm_tdp_mmu_get_walk_mirror_pfn(struct kvm_vcpu *vcpu, u64 gpa, kvm_pfn_t
*pfn);
+int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8
*level);

static inline bool kvm_memslots_have_rmaps(struct kvm *kvm)
{
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 7bb6b17b455f..4a3e471ec9fe 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4721,8 +4721,7 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct
kvm_page_fault *fault)
return direct_page_fault(vcpu, fault);
}

-static int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code,
- u8 *level)
+int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8
*level)
{
int r;

@@ -4759,6 +4758,7 @@ static int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t
gpa, u64 error_code,
return -EIO;
}
}
+EXPORT_SYMBOL_GPL(kvm_tdp_map_page);

long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
struct kvm_pre_fault_memory *range)
@@ -5770,6 +5770,7 @@ int kvm_mmu_load(struct kvm_vcpu *vcpu)
out:
return r;
}
+EXPORT_SYMBOL_GPL(kvm_mmu_load);

void kvm_mmu_unload(struct kvm_vcpu *vcpu)
{
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 9ac0821eb44b..7161ef68f3da 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -2809,11 +2809,13 @@ struct tdx_gmem_post_populate_arg {
static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
void __user *src, int order, void *_arg)
{
+ u64 error_code = PFERR_GUEST_FINAL_MASK | PFERR_PRIVATE_ACCESS;
struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
struct tdx_gmem_post_populate_arg *arg = _arg;
struct kvm_vcpu *vcpu = arg->vcpu;
struct kvm_memory_slot *slot;
gpa_t gpa = gfn_to_gpa(gfn);
+ u8 level = PG_LEVEL_4K;
struct page *page;
kvm_pfn_t mmu_pfn;
int ret, i;
@@ -2832,6 +2834,10 @@ static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t
gfn, kvm_pfn_t pfn,
goto out_put_page;
}

+ ret = kvm_tdp_map_page(vcpu, gpa, error_code, &level);
+ if (ret < 0)
+ goto out_put_page;
+
read_lock(&kvm->mmu_lock);

ret = kvm_tdp_mmu_get_walk_mirror_pfn(vcpu, gpa, &mmu_pfn);
@@ -2910,6 +2916,7 @@ static int tdx_vcpu_init_mem_region(struct kvm_vcpu *vcpu,
struct kvm_tdx_cmd *c
mutex_lock(&kvm->slots_lock);
idx = srcu_read_lock(&kvm->srcu);

+ kvm_mmu_reload(vcpu);
ret = 0;
while (region.nr_pages) {
if (signal_pending(current)) {