Re: [PATCH 09/21] KVM: TDX: Retry seamcall when TDX_OPERAND_BUSY with operand SEPT

From: Sean Christopherson
Date: Tue Sep 10 2024 - 11:57:12 EST


On Tue, Sep 10, 2024, Paolo Bonzini wrote:
> On Tue, Sep 10, 2024 at 3:58 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
> > On Tue, Sep 10, 2024, Paolo Bonzini wrote:
> > No, because that defeates the purpose of having mmu_lock be a rwlock.
>
> But if this part of the TDX module is wrapped in a single big
> try_lock, there's no difference in spinning around busy seamcalls, or
> doing spin_lock(&kvm->arch.seamcall_lock). All of them hit contention
> in the same way. With respect to FROZEN_SPTE...
>
> > > This way we know that "busy" errors must come from the guest and have set
> > > HOST_PRIORITY.
> >
> > We should be able to achieve that without a VM-wide spinlock. My thought (from
> > v11?) was to effectively use the FROZEN_SPTE bit as a per-SPTE spinlock, i.e. keep
> > it set until the SEAMCALL completes.
>
> Only if the TDX module returns BUSY per-SPTE (as suggested by 18.1.3,
> which documents that the TDX module returns TDX_OPERAND_BUSY on a
> CMPXCHG failure). If it returns BUSY per-VM, FROZEN_SPTE is not enough
> to prevent contention in the TDX module.

Looking at the TDX module code, things like (UN)BLOCK and REMOVE take a per-VM
lock in write mode, but ADD, AUG, and PROMOTE/DEMOTE take the lock in read mode.

So for the operations that KVM can do in parallel, the locking should effectively
be per-entry. Because KVM will never throw away an entire S-EPT root, zapping
SPTEs will need to be done while holding mmu_lock for write, i.e. KVM shouldn't
have problems with host tasks competing for the TDX module's VM-wide lock.

> If we want to be a bit more optimistic, let's do something more
> sophisticated, like only take the lock after the first busy reply. But
> the spinlock is the easiest way to completely remove host-induced
> TDX_OPERAND_BUSY, and only have to deal with guest-induced ones.

I am not convinced that's necessary or a good idea. I worry that doing so would
just kick the can down the road, and potentially make the problems harder to solve,
e.g. because we'd have to worry about regressing existing setups.

> > > It is still kinda bad that guests can force the VMM to loop, but the VMM can
> > > always say enough is enough. In other words, let's assume that a limit of
> > > 16 is probably appropriate but we can also increase the limit and crash the
> > > VM if things become ridiculous.
> >
> > 2 :-)
> >
> > One try that guarantees no other host task is accessing the S-EPT entry, and a
> > second try after blasting IPI to kick vCPUs to ensure no guest-side task has
> > locked the S-EPT entry.
>
> Fair enough. Though in principle it is possible to race and have the
> vCPU re-run and re-issue a TDG call before KVM re-issues the TDH call.

My limit of '2' is predicated on the lock being a "host priority" lock, i.e. that
kicking vCPUs would ensure the lock has been dropped and can't be re-acquired by
the guest.

> So I would make it 5 or so just to be safe.
>
> > My concern with an arbitrary retry loop is that we'll essentially propagate the
> > TDX module issues to the broader kernel. Each of those SEAMCALLs is slooow, so
> > retrying even ~20 times could exceed the system's tolerances for scheduling, RCU,
> > etc...
>
> How slow are the failed ones? The number of retries is essentially the
> cost of successful seamcall / cost of busy seamcall.

I haven't measured, but would be surprised if it's less than 2000 cycles.

> If HOST_PRIORITY works, even a not-small-but-not-huge number of
> retries would be better than the IPIs. IPIs are not cheap either.

Agreed, but we also need to account for the operations that are conflicting.
E.g. if KVM is trying to zap a S-EPT that the guest is accessing, then busy waiting
for the to-be-zapped S-EPT entry to be available doesn't make much sense.

> > > For zero step detection, my reading is that it's TDH.VP.ENTER that fails;
> > > not any of the MEM seamcalls. For that one to be resolved, it should be
> > > enough to do take and release the mmu_lock back to back, which ensures that
> > > all pending critical sections have completed (that is,
> > > "write_lock(&kvm->mmu_lock); write_unlock(&kvm->mmu_lock);"). And then
> > > loop. Adding a vCPU stat for that one is a good idea, too.
> >
> > As above and in my discussion with Rick, I would prefer to kick vCPUs to force
> > forward progress, especially for the zero-step case. If KVM gets to the point
> > where it has retried TDH.VP.ENTER on the same fault so many times that zero-step
> > kicks in, then it's time to kick and wait, not keep retrying blindly.
>
> Wait, zero-step detection should _not_ affect TDH.MEM latency. Only
> TDH.VP.ENTER is delayed.

Blocked, not delayed. Yes, it's TDH.VP.ENTER that "fails", but to get past
TDH.VP.ENTER, KVM needs to resolve the underlying fault, i.e. needs to guarantee
forward progress for TDH.MEM (or whatever the operations are called).

Though I wonder, are there any operations guest/host operations that can conflict
if the vCPU is faulting? Maybe this particular scenario is a complete non-issue.

> If it is delayed to the point of failing, we can do write_lock/write_unlock()
> in the vCPU entry path.

I was thinking that KVM could set a flag (another synthetic error code bit?) to
tell the page fault handler that it needs to kick vCPUs. But as above, it might
be unnecessary.

> My issue is that, even if we could make it a bit better by looking at
> the TDX module source code, we don't have enough information to make a
> good choice. For now we should start with something _easy_, even if
> it may not be the greatest.

I am not opposed to an easy/simple solution, but I am very much opposed to
implementing a retry loop without understanding _exactly_ when and why it's
needed.