Re: [PATCH 09/12] KVM: guest_memfd: move check for already-populated page to common code
From: Michael Roth
Date: Mon Jul 15 2024 - 17:48:34 EST
On Mon, Jul 15, 2024 at 06:08:51PM +0200, Paolo Bonzini wrote:
> On Sun, Jul 14, 2024 at 7:33 AM Michael Roth <michael.roth@xxxxxxx> wrote:
> > > I guess this series is trying to help userspace not mess up the order of things
> > > for SEV, where as TDX's design was to let userspace hold the pieces from the
> > > beginning. As in, needing to match up the KVM_PRE_FAULT_MEMORY and
> > > KVM_TDX_INIT_MEM_REGION calls, mysteriously return errors in later IOCTLs if
> > > something was missed, etc.
> >
> > If SNP were to try to call KVM_PRE_FAULT_MEMORY before SNP_LAUNCH_UPDATE
> > (rough equivalent to KVM_TDX_INIT_MEM_REGION), I think the same issue
> > would arise, and in that case the uptodate flag you prototyped would
> > wouldn't be enough to address it because SNP_LAUNCH_UPDATE would end up
> > failing because the gmem_prepare hook previously triggered by
> > KVM_PRE_FAULT_MEMORY would have put the corresponding RMP entries into
> > an unexpected state (guest-owned/private).
>
> Indeed, and I'd love for that to be the case for both TDX and SNP.
>
> > So for SNP, KVM_PRE_FAULT_MEMORY/SNP_LAUNCH_UPDATE are mutually
> > exclusive on what GPA ranges they can prep before finalizing launch state.
>
> Not a problem; is KVM_PRE_FAULT_MEMORY before finalization the same as
> zeroing memory?
Sort of: you get a page that's zero'd by
gmem->kvm_gmem_prepare_folio()->clear_highpage(), and then that page is
mapped in the guest as a private page. But since no encrypted data was
written the guest will effectively see ciphertext until the guest actually
initializes the page after accepting it.
But you can sort of treat that whole sequence as being similar to calling
sev_gmem_post_populate() with page type KVM_SEV_SNP_PAGE_TYPE_ZERO (or
rather, a theoretical KVM_SEV_SNP_PAGE_TYPE_SCRAMBLE in this case), just
that in that case the page won't be pre-validated and it won't be
measured into the launch digest. But still if you look at it that way it's
a bit clearer why pre-fault shouldn't be performed on any ranges that will
later be populated again (unless gmem_populate callback itself handles it
and so aware of the restrictions).
>
> > I realize that is awkward for TDX, where the KVM_PRE_FAULT_MEMORY is
> > required to create the sEPT mapping before encrypting, but maybe it
> > would be possible for TDX to just do that implicitly within
> > KVM_TDX_INIT_MEM_REGION?
>
> Yes, and it's what the TDX API used to be like a while ago.
> Locking-wise, Rick confirmed offlist that there's no problem in
> calling kvm_arch_vcpu_pre_fault_memory() from tdx_gmem_post_populate()
> (my fault that it went offlist - email from the phone is hard...).
That's promising, if we go that route it probably makes sense for SNP to
do that as well, even though it's only an optimization in that case.
>
> To be clear, I have no problem at all reusing the prefaulting code,
> that's better than TDX having to do its own thing. But forcing
> userspace to do two passes is not great (it's already not great that
> it has to be TDX_INIT_MEM_REGION has to be a VCPU operation, but
> that's unfortunately unavoidable ).
Makes sense.
If we document mutual exclusion between ranges touched by
gmem_populate() vs ranges touched by actual userspace issuance of
KVM_PRE_FAULT_MEMORY (and make sure nothing too crazy happens if users
don't abide by the documentation), then I think most problems go away...
But there is still at least one awkward constraint for SNP:
KVM_PRE_FAULT_MEMORY cannot be called for private GPA ranges until after
SNP_LAUNCH_START is called. This is true even if the GPA range is not
one of the ranges that will get passed to
gmem_populate()/SNP_LAUNCH_UPDATE. The reason for this is that when
binding the ASID to the SNP context as part of SNP_LAUNCH_START, firmware
will perform checks to make sure that ASID is not already being used in
the RMP table, and that check will fail if KVM_PRE_FAULT_MEMORY triggered
for a private page before calling SNP_LAUNCH_START.
So we have this constraint that KVM_PRE_FAULT_MEMORY can't be issued
before SNP_LAUNCH_START. So it makes me wonder if we should just broaden
that constraint and for now just disallow KVM_PRE_FAULT_MEMORY prior to
finalizing a guest, since it'll be easier to lift that restriction later
versus discovering some other sort of edge case and need to
retroactively place restrictions.
I've taken Isaku's original pre_fault_memory_test and added a new
x86-specific coco_pre_fault_memory_test to try to better document and
exercise these corner cases for SEV and SNP, but I'm hoping it could
also be useful for TDX (hence the generic name). These are based on
Pratik's initial SNP selftests (which are in turn based on kvm/queue +
these patches):
https://github.com/mdroth/linux/commits/snp-uptodate0-kst/
https://github.com/mdroth/linux/commit/dd7d4b1983ceeb653132cfd54ad63f597db85f74
It could be interesting to also add some coverage interleaving of
fallocate()/FALLOC_FL_PUNCH_HOLE as well. I'll look into that more,
and more negative testing depending on the final semantics we end up
with.
Thanks,
Mike
>
> Paolo
>