Re: [PATCH 09/12] KVM: guest_memfd: move check for already-populated page to common code

From: Edgecombe, Rick P
Date: Sat Jul 13 2024 - 16:25:57 EST


On Sat, 2024-07-13 at 12:10 +0200, Paolo Bonzini wrote:
> >
> > This patch breaks our rebased TDX development tree. First
> > kvm_gmem_prepare_folio() is called during the KVM_PRE_FAULT_MEMORY
> > operation,
> > then next kvm_gmem_populate() is called during the KVM_TDX_INIT_MEM_REGION
> > ioctl
> > to actually populate the memory, which hits the new -EEXIST error path.
>
> It's not a problem to only keep patches 1-8 for 6.11, and move the
> rest to 6.12 (except for the bit that returns -EEXIST in sev.c).
>
> Could you push a branch for me to take a look?

Sure, here it is.

KVM:
https://github.com/rpedgeco/linux/tree/tdx_kvm_dev-2024-07-12-mark_uptodate_issue
Matching QEMU:
https://github.com/intel-staging/qemu-tdx/releases/tag/tdx-qemu-wip-2024.06.19-v9.0.0

It is not fully based on kvm-coco-queue because it has the latest v2 of the
zapping quirk swapped in.

> I've never liked that
> you have to do the explicit prefault before the VM setup is finished;
> it's a TDX-specific detail that is transpiring into the API.

Well, it's not too late to change direction again. I remember you and Sean were
not fully of one mind on the tradeoffs.

I guess this series is trying to help userspace not mess up the order of things
for SEV, where as TDX's design was to let userspace hold the pieces from the
beginning. As in, needing to match up the KVM_PRE_FAULT_MEMORY and
KVM_TDX_INIT_MEM_REGION calls, mysteriously return errors in later IOCTLs if
something was missed, etc.

Still, I might lean towards staying the course just because we have gone down
this path for a while and we don't currently have any fundamental issues.
Probably we *really* need to get the next TDX MMU stuff posted so we can start
to add a bit more certainty to that statement.

>
> > Given we are not actually populating during KVM_PRE_FAULT_MEMORY and try to
> > avoid booting a TD until we've done so, maybe setting folio_mark_uptodate()
> > in
> > kvm_gmem_prepare_folio() is not appropriate in that case? But it may not be
> > easy
> > to separate.
>
> It would be easy (just return a boolean value from
> kvm_arch_gmem_prepare() to skip folio_mark_uptodate() before the VM is
> ready, and implement it for TDX) but it's ugly. You're also clearing
> the memory unnecessarily before overwriting it.

Hmm, right. Since kvm_gmem_populate() does folio_mark_uptodate() again despite
testing for it earlier, we can skip folio_mark_uptodate() in
kvm_gmem_prepare_folio() for TDX during the pre-finalization stage and it will
get marked there.

I put a little POC of this suggestion at the end of the branch. Just revert it
to reproduce the issue.

I think in the context of the work to launch a TD, extra clearing of pages is
not too bad. I'm more bothered by how it highlights the general pitfalls of
TDX's special clever behavior for KVM_PRE_FAULT_MEMORY before TD initialization.

If/when we want to skip it, I wonder if we could move the clearing into the
gmem_prepare callbacks.