Re: [PATCH 09/11] KVM: guest_memfd: Add interface for populating gmem pages with user data

From: Sean Christopherson
Date: Mon Jun 10 2024 - 19:41:10 EST


On Tue, Jun 11, 2024, Paolo Bonzini wrote:
> On 6/10/24 23:48, Paolo Bonzini wrote:
> > On Sat, Jun 8, 2024 at 1:03 AM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
> > > SNP folks and/or Paolo, what's the plan for this? I don't see how what's sitting
> > > in kvm/next can possibly be correct without conditioning population on the folio
> > > being !uptodate.
> >
> > I don't think I have time to look at it closely until Friday; but
> > thanks for reminding me.
>
> Ok, I'm officially confused. I think I understand what you did in your
> suggested code. Limiting it to the bare minimum (keeping the callback
> instead of CONFIG_HAVE_KVM_GMEM_INITIALIZE) it would be something
> like what I include at the end of the message.
>
> But the discussion upthread was about whether to do the check for
> RMP state in sev.c, or do it in common code using folio_mark_uptodate().
> I am not sure what you mean by "cannot possibly be correct", and
> whether it's referring to kvm_gmem_populate() in general or the
> callback in sev_gmem_post_populate().

Doing fallocate() before KVM_SEV_SNP_LAUNCH_UPDATE will cause the latter to fail.
That likely works for QEMU, at least for now, but it's unnecessarily restrictive
and IMO incorrect/wrong.

E.g. a more convoluted, fallocate() + PUNCH_HOLE + KVM_SEV_SNP_LAUNCH_UPDATE will
work (I think? AFAICT adding and removing pages directly to/from the RMP doesn't
affect SNP's measurement, only pages that are added via SNP_LAUNCH_UPDATE affect
the measurement).

Punting the sanity check to vendor code is also gross and will make it harder to
provide a consistent, unified ABI for all architectures. E.g. SNP returns -EINVAL
if the page is already assigned, which is quite misleading.

> The change below looks like just an optimization to me, which
> suggests that I'm missing something glaring.

I really dislike @prepare. There are two paths that should actually initialize
the contents of the folio, and they are mutually exclusive and have meaningfully
different behavior. Faulting in memory via kvm_gmem_get_pfn() explicitly zeros
the folio _if necessary_, whereas kvm_gmem_populate() initializes the folio with
user-provided data _and_ requires that the folio be !uptodate.

If we fix the above oddity where fallocate() initializes memory, then there's
no need to try and handle the initialization in a common chokepoint as the two
relevant paths will naturally have unique code.

The below is also still suboptimal for TDX, as KVM will zero the memory and then
TDX-module will also zero memory on PAGE.AUGA.

And I find SNP to be the odd one. IIUC, the ASP (the artist formerly known as
the PSP) doesn't provide any guarantees about the contents of a page that is
assigned to a guest without bouncing through SNP_LAUNCH_UPDATE. It'd be nice to
explicitly document that somewhere in the SNP code. E.g. if guest_memfd invokes
a common kvm_gmem_initialize_folio() or whatever, then SNP's implementation can
clearly capture that KVM zeros the page to protect the _host_ data.

> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index d4206e53a9c81..a0417ef5b86eb 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -52,37 +52,39 @@ static int kvm_gmem_prepare_folio(struct inode *inode, pgoff_t index, struct fol
> static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index, bool prepare)
> {
> struct folio *folio;
> + int r;
> /* TODO: Support huge pages. */
> folio = filemap_grab_folio(inode->i_mapping, index);
> if (IS_ERR(folio))
> return folio;
> - /*
> - * Use the up-to-date flag to track whether or not the memory has been
> - * zeroed before being handed off to the guest. There is no backing
> - * storage for the memory, so the folio will remain up-to-date until
> - * it's removed.
> - *
> - * TODO: Skip clearing pages when trusted firmware will do it when
> - * assigning memory to the guest.
> - */
> - if (!folio_test_uptodate(folio)) {
> - unsigned long nr_pages = folio_nr_pages(folio);
> - unsigned long i;
> + if (prepare) {
> + /*
> + * Use the up-to-date flag to track whether or not the memory has
> + * been handed off to the guest. There is no backing storage for
> + * the memory, so the folio will remain up-to-date until it's
> + * removed.
> + *
> + * Take the occasion of the first prepare operation to clear it.
> + */
> + if (!folio_test_uptodate(folio)) {
> + unsigned long nr_pages = folio_nr_pages(folio);
> + unsigned long i;
> - for (i = 0; i < nr_pages; i++)
> - clear_highpage(folio_page(folio, i));
> + for (i = 0; i < nr_pages; i++)
> + clear_highpage(folio_page(folio, i));
> + }
> +
> + r = kvm_gmem_prepare_folio(inode, index, folio);
> + if (r < 0)
> + goto err_unlock_put;
> folio_mark_uptodate(folio);
> - }
> -
> - if (prepare) {
> - int r = kvm_gmem_prepare_folio(inode, index, folio);
> - if (r < 0) {
> - folio_unlock(folio);
> - folio_put(folio);
> - return ERR_PTR(r);
> + } else {
> + if (folio_test_uptodate(folio)) {
> + r = -EEXIST;
> + goto err_unlock_put;
> }
> }