Re: [PATCH 09/21] KVM: TDX: Retry seamcall when TDX_OPERAND_BUSY with operand SEPT

From: Sean Christopherson
Date: Tue Sep 10 2024 - 13:44:56 EST


On Tue, Sep 10, 2024, Rick P Edgecombe wrote:
> On Tue, 2024-09-10 at 08:57 -0700, Sean Christopherson wrote:
> > > Only if the TDX module returns BUSY per-SPTE (as suggested by 18.1.3,
> > > which documents that the TDX module returns TDX_OPERAND_BUSY on a
> > > CMPXCHG failure). If it returns BUSY per-VM, FROZEN_SPTE is not enough
> > > to prevent contention in the TDX module.
> >
> > Looking at the TDX module code, things like (UN)BLOCK and REMOVE take a per-VM
> > lock in write mode, but ADD, AUG, and PROMOTE/DEMOTE take the lock in read
> > mode.
>
> AUG does take other locks as exclusive:
> https://github.com/intel/tdx-module/blob/tdx_1.5/src/vmm_dispatcher/api_calls/tdh_mem_page_aug.c

Only a lock on the underlying physical page. guest_memfd should prevent mapping
the same HPA into multiple GPAs, and FROZEN_SPTE should prevent two vCPUs from
concurrently AUGing the same GPA+HPA.

> I count 5 locks in total as well. I think trying to mirror the locking in KVM
> will be an uphill battle.

I don't want to mirror the locking, I want to understand and document the
expectations and rules. "Throw 16 noodles and hope one sticks" is not a recipe
for success.

> > So for the operations that KVM can do in parallel, the locking should
> > effectively
> > be per-entry.  Because KVM will never throw away an entire S-EPT root, zapping
> > SPTEs will need to be done while holding mmu_lock for write, i.e. KVM
> > shouldn't
> > have problems with host tasks competing for the TDX module's VM-wide lock.
> >
> > > If we want to be a bit more optimistic, let's do something more
> > > sophisticated, like only take the lock after the first busy reply. But
> > > the spinlock is the easiest way to completely remove host-induced
> > > TDX_OPERAND_BUSY, and only have to deal with guest-induced ones.
> >
> > I am not convinced that's necessary or a good idea.  I worry that doing so
> > would
> > just kick the can down the road, and potentially make the problems harder to
> > solve,
> > e.g. because we'd have to worry about regressing existing setups.
> >
> > > > > It is still kinda bad that guests can force the VMM to loop, but the VMM
> > > > > can
> > > > > always say enough is enough.  In other words, let's assume that a limit
> > > > > of
> > > > > 16 is probably appropriate but we can also increase the limit and crash
> > > > > the
> > > > > VM if things become ridiculous.
> > > >
> > > > 2 :-)
> > > >
> > > > One try that guarantees no other host task is accessing the S-EPT entry,
> > > > and a
> > > > second try after blasting IPI to kick vCPUs to ensure no guest-side task
> > > > has
> > > > locked the S-EPT entry.
> > >
> > > Fair enough. Though in principle it is possible to race and have the
> > > vCPU re-run and re-issue a TDG call before KVM re-issues the TDH call.
> >
> > My limit of '2' is predicated on the lock being a "host priority" lock,
> > i.e. that kicking vCPUs would ensure the lock has been dropped and can't
> > be re-acquired by the guest.
>
> So kicking would be to try to break loose any deadlock we encountered? It sounds
> like the kind of kludge that could be hard to remove.

No, the intent of the kick would be to wait for vCPUs to exit, which in turn
guarantees that any locks held by vCPUs have been dropped. Again, this idea is
predicated on the lock being "host priority", i.e. that vCPUs can't re-take the
lock before KVM.