Re: [PATCH v19 038/130] KVM: TDX: create/destroy VM structure

From: Huang, Kai
Date: Thu Mar 28 2024 - 07:15:09 EST


On Wed, 2024-03-27 at 22:34 -0700, Isaku Yamahata wrote:
> On Thu, Mar 28, 2024 at 02:49:56PM +1300,
> "Huang, Kai" <kai.huang@xxxxxxxxx> wrote:
>
> >
> >
> > On 28/03/2024 11:53 am, Isaku Yamahata wrote:
> > > On Tue, Mar 26, 2024 at 02:43:54PM +1300,
> > > "Huang, Kai" <kai.huang@xxxxxxxxx> wrote:
> > >
> > > > ... continue the previous review ...
> > > >
> > > > > +
> > > > > +static void tdx_reclaim_control_page(unsigned long td_page_pa)
> > > > > +{
> > > > > + WARN_ON_ONCE(!td_page_pa);
> > > >
> > > > From the name 'td_page_pa' we cannot tell whether it is a control page, but
> > > > this function is only intended for control page AFAICT, so perhaps a more
> > > > specific name.
> > > >
> > > > > +
> > > > > + /*
> > > > > + * TDCX are being reclaimed. TDX module maps TDCX with HKID
> > > >
> > > > "are" -> "is".
> > > >
> > > > Are you sure it is TDCX, but not TDCS?
> > > >
> > > > AFAICT TDCX is the control structure for 'vcpu', but here you are handling
> > > > the control structure for the VM.
> > >
> > > TDCS, TDVPR, and TDCX. Will update the comment.
> >
> > But TDCX, TDVPR are vcpu-scoped. Do you want to mention them _here_?
>
> So I'll make the patch that frees TDVPR, TDCX will change this comment.
>

Hmm.. Looking again, I am not sure why do we even need
tdx_reclaim_control_page()?

It basically does tdx_reclaim_page() + free_page():

+static void tdx_reclaim_control_page(unsigned long td_page_pa)
+{
+ WARN_ON_ONCE(!td_page_pa);
+
+ /*
+ * TDCX are being reclaimed. TDX module maps TDCX with HKID
+ * assigned to the TD. Here the cache associated to the TD
+ * was already flushed by TDH.PHYMEM.CACHE.WB before here, So
+ * cache doesn't need to be flushed again.
+ */
+ if (tdx_reclaim_page(td_page_pa))
+ /*
+ * Leak the page on failure:
+ * tdx_reclaim_page() returns an error if and only if there's
an
+ * unexpected, fatal error, e.g. a SEAMCALL with bad params,
+ * incorrect concurrency in KVM, a TDX Module bug, etc.
+ * Retrying at a later point is highly unlikely to be
+ * successful.
+ * No log here as tdx_reclaim_page() already did.
+ */
+ return;
+ free_page((unsigned long)__va(td_page_pa));
+}

And why do you need a special function just for control page(s)?

>
> > Otherwise you will have to explain them.
> >
> > [...]
> >
> > > > > +
> > > > > +void tdx_mmu_release_hkid(struct kvm *kvm)
> > > > > +{
> > > > > + bool packages_allocated, targets_allocated;
> > > > > + struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> > > > > + cpumask_var_t packages, targets;
> > > > > + u64 err;
> > > > > + int i;
> > > > > +
> > > > > + if (!is_hkid_assigned(kvm_tdx))
> > > > > + return;
> > > > > +
> > > > > + if (!is_td_created(kvm_tdx)) {
> > > > > + tdx_hkid_free(kvm_tdx);
> > > > > + return;
> > > > > + }
> > > >
> > > > I lost tracking what does "td_created()" mean.
> > > >
> > > > I guess it means: KeyID has been allocated to the TDX guest, but not yet
> > > > programmed/configured.
> > > >
> > > > Perhaps add a comment to remind the reviewer?
> > >
> > > As Chao suggested, will introduce state machine for vm and vcpu.
> > >
> > > https://lore.kernel.org/kvm/ZfvI8t7SlfIsxbmT@chao-email/
> >
> > Could you elaborate what will the state machine look like?
> >
> > I need to understand it.
>
> Not yet. Chao only propose to introduce state machine. Right now it's just an
> idea.

Then why state machine is better? I guess we need some concrete example to tell
which is better?

>
>
> > > How about this?
> > >
> > > /*
> > > * We need three SEAMCALLs, TDH.MNG.VPFLUSHDONE(), TDH.PHYMEM.CACHE.WB(), and
> > > * TDH.MNG.KEY.FREEID() to free the HKID.
> > > * Other threads can remove pages from TD. When the HKID is assigned, we need
> > > * to use TDH.MEM.SEPT.REMOVE() or TDH.MEM.PAGE.REMOVE().
> > > * TDH.PHYMEM.PAGE.RECLAIM() is needed when the HKID is free. Get lock to not
> > > * present transient state of HKID.
> > > */
> >
> > Could you elaborate why it is still possible to have other thread removing
> > pages from TD?
> >
> > I am probably missing something, but the thing I don't understand is why
> > this function is triggered by MMU release? All the things done in this
> > function don't seem to be related to MMU at all.
>
> The KVM releases EPT pages on MMU notifier release. kvm_mmu_zap_all() does. If
> we follow that way, kvm_mmu_zap_all() zaps all the Secure-EPTs by
> TDH.MEM.SEPT.REMOVE() or TDH.MEM.PAGE.REMOVE(). Because
> TDH.MEM.{SEPT, PAGE}.REMOVE() is slow, we can free HKID before kvm_mmu_zap_all()
> to use TDH.PHYMEM.PAGE.RECLAIM().

Can you elaborate why TDH.MEM.{SEPT,PAGE}.REMOVE is slower than
TDH.PHYMEM.PAGE.RECLAIM()?

And does the difference matter in practice, i.e. did you see using the former
having noticeable performance downgrade?

>
>
> > IIUC, by reaching here, you must already have done VPFLUSHDONE, which should
> > be called when you free vcpu?
>
> Not necessarily.

OK. I got confused between TDH.VP.FLUSH and TDH.MNG.VPFLUSHDONE.

>
>
> > Freeing vcpus is done in
> > kvm_arch_destroy_vm(), which is _after_ mmu_notifier->release(), in which
> > this tdx_mmu_release_keyid() is called?
>
> guest memfd complicates things. The race is between guest memfd release and mmu
> notifier release. kvm_arch_destroy_vm() is called after closing all kvm fds
> including guest memfd.
>
> Here is the example. Let's say, we have fds for vhost, guest_memfd, kvm vcpu,
> and kvm vm. The process is exiting. Please notice vhost increments the
> reference of the mmu to access guest (shared) memory.
>
> exit_mmap():
> Usually mmu notifier release is fired. But not yet because of vhost.
>
> exit_files()
> close vhost fd. vhost starts timer to issue mmput().

Why does it need to start a timer to issue mmput(), but not call mmput()
directly?

>
> close guest_memfd. kvm_gmem_release() calls kvm_mmu_unmap_gfn_range().
> kvm_mmu_unmap_gfn_range() eventually this calls TDH.MEM.SEPT.REMOVE()
> and TDH.MEM.PAGE.REMOVE(). This takes time because it processes whole
> guest memory. Call kvm_put_kvm() at last.
>
> During unmapping on behalf of guest memfd, the timer of vhost fires to call
> mmput(). It triggers mmu notifier release.
>
> Close kvm vcpus/vm. they call kvm_put_kvm(). The last one calls
> kvm_destroy_vm().
>
> It's ideal to free HKID first for efficiency. But KVM doesn't have control on
> the order of fds.

Firstly, what kinda performance efficiency gain are we talking about?

We cannot really tell whether it can be justified to use two different methods
to tear down SEPT page because of this.

Even if it's worth to do, it is an optimization, which can/should be done later
after you have put all building blocks together.

That being said, you are putting too many logic in this patch, i.e., it just
doesn't make sense to release TDX keyID in the MMU code path in _this_ patch.

>
>
> > But here we are depending vcpus to be freed before tdx_mmu_release_hkid()?
>
> Not necessarily.

I am wondering when is TDH.VP.FLUSH done? Supposedly it should be called when
we free vcpus? But again this means you need to call TDH.MNG.VPFLUSHDONE
_after_ freeing vcpus. And this looks conflicting if you make
tdx_mmu_release_keyid() being called from MMU notifier.