Re: [PATCH 09/21] KVM: TDX: Retry seamcall when TDX_OPERAND_BUSY with operand SEPT
From: Edgecombe, Rick P
Date: Tue Sep 10 2024 - 12:28:42 EST
On Tue, 2024-09-10 at 08:57 -0700, Sean Christopherson wrote:
> > Only if the TDX module returns BUSY per-SPTE (as suggested by 18.1.3,
> > which documents that the TDX module returns TDX_OPERAND_BUSY on a
> > CMPXCHG failure). If it returns BUSY per-VM, FROZEN_SPTE is not enough
> > to prevent contention in the TDX module.
>
> Looking at the TDX module code, things like (UN)BLOCK and REMOVE take a per-VM
> lock in write mode, but ADD, AUG, and PROMOTE/DEMOTE take the lock in read
> mode.
AUG does take other locks as exclusive:
https://github.com/intel/tdx-module/blob/tdx_1.5/src/vmm_dispatcher/api_calls/tdh_mem_page_aug.c
I count 5 locks in total as well. I think trying to mirror the locking in KVM
will be an uphill battle.
>
> So for the operations that KVM can do in parallel, the locking should
> effectively
> be per-entry. Because KVM will never throw away an entire S-EPT root, zapping
> SPTEs will need to be done while holding mmu_lock for write, i.e. KVM
> shouldn't
> have problems with host tasks competing for the TDX module's VM-wide lock.
>
> > If we want to be a bit more optimistic, let's do something more
> > sophisticated, like only take the lock after the first busy reply. But
> > the spinlock is the easiest way to completely remove host-induced
> > TDX_OPERAND_BUSY, and only have to deal with guest-induced ones.
>
> I am not convinced that's necessary or a good idea. I worry that doing so
> would
> just kick the can down the road, and potentially make the problems harder to
> solve,
> e.g. because we'd have to worry about regressing existing setups.
>
> > > > It is still kinda bad that guests can force the VMM to loop, but the VMM
> > > > can
> > > > always say enough is enough. In other words, let's assume that a limit
> > > > of
> > > > 16 is probably appropriate but we can also increase the limit and crash
> > > > the
> > > > VM if things become ridiculous.
> > >
> > > 2 :-)
> > >
> > > One try that guarantees no other host task is accessing the S-EPT entry,
> > > and a
> > > second try after blasting IPI to kick vCPUs to ensure no guest-side task
> > > has
> > > locked the S-EPT entry.
> >
> > Fair enough. Though in principle it is possible to race and have the
> > vCPU re-run and re-issue a TDG call before KVM re-issues the TDH call.
>
> My limit of '2' is predicated on the lock being a "host priority" lock, i.e.
> that
> kicking vCPUs would ensure the lock has been dropped and can't be re-acquired
> by
> the guest.
So kicking would be to try to break loose any deadlock we encountered? It sounds
like the kind of kludge that could be hard to remove.
>
> > So I would make it 5 or so just to be safe.
> >
> > > My concern with an arbitrary retry loop is that we'll essentially
> > > propagate the
> > > TDX module issues to the broader kernel. Each of those SEAMCALLs is
> > > slooow, so
> > > retrying even ~20 times could exceed the system's tolerances for
> > > scheduling, RCU,
> > > etc...
> >
> > How slow are the failed ones? The number of retries is essentially the
> > cost of successful seamcall / cost of busy seamcall.
>
> I haven't measured, but would be surprised if it's less than 2000 cycles.
>
> > If HOST_PRIORITY works, even a not-small-but-not-huge number of
> > retries would be better than the IPIs. IPIs are not cheap either.
>
> Agreed, but we also need to account for the operations that are conflicting.
> E.g. if KVM is trying to zap a S-EPT that the guest is accessing, then busy
> waiting
> for the to-be-zapped S-EPT entry to be available doesn't make much sense.
>
> > > > For zero step detection, my reading is that it's TDH.VP.ENTER that
> > > > fails;
> > > > not any of the MEM seamcalls. For that one to be resolved, it should be
> > > > enough to do take and release the mmu_lock back to back, which ensures
> > > > that
> > > > all pending critical sections have completed (that is,
> > > > "write_lock(&kvm->mmu_lock); write_unlock(&kvm->mmu_lock);"). And then
> > > > loop. Adding a vCPU stat for that one is a good idea, too.
> > >
> > > As above and in my discussion with Rick, I would prefer to kick vCPUs to
> > > force
> > > forward progress, especially for the zero-step case. If KVM gets to the
> > > point
> > > where it has retried TDH.VP.ENTER on the same fault so many times that
> > > zero-step
> > > kicks in, then it's time to kick and wait, not keep retrying blindly.
> >
> > Wait, zero-step detection should _not_ affect TDH.MEM latency. Only
> > TDH.VP.ENTER is delayed.
>
> Blocked, not delayed. Yes, it's TDH.VP.ENTER that "fails", but to get past
> TDH.VP.ENTER, KVM needs to resolve the underlying fault, i.e. needs to
> guarantee
> forward progress for TDH.MEM (or whatever the operations are called).
>
> Though I wonder, are there any operations guest/host operations that can
> conflict
> if the vCPU is faulting? Maybe this particular scenario is a complete non-
> issue.
>
> > If it is delayed to the point of failing, we can do
> > write_lock/write_unlock()
> > in the vCPU entry path.
>
> I was thinking that KVM could set a flag (another synthetic error code bit?)
> to
> tell the page fault handler that it needs to kick vCPUs. But as above, it
> might
> be unnecessary.
>
> > My issue is that, even if we could make it a bit better by looking at
> > the TDX module source code, we don't have enough information to make a
> > good choice. For now we should start with something _easy_, even if
> > it may not be the greatest.
>
> I am not opposed to an easy/simple solution, but I am very much opposed to
> implementing a retry loop without understanding _exactly_ when and why it's
> needed.
I'd like to explore letting KVM do the retries (i.e. EPT fault loop) a bit more.
We can verify that we can survive zero-step in this case. After all, zero-step
doesn't kill the TD, just generates an EPT violation exit. So we would just need
to verify that the EPT violation getting generated would result in KVM
eventually fixing whatever zero-step is requiring.
Then we would have to handle BUSY in each SEAMCALL call chain, which currently
we don't. Like the zapping case. If we ended up needing a retry loop for limited
cases like that, at least it would be more limited.