Re: [PATCH 09/21] KVM: TDX: Retry seamcall when TDX_OPERAND_BUSY with operand SEPT

From: Paolo Bonzini
Date: Tue Sep 10 2024 - 09:15:53 EST


On 9/9/24 23:11, Sean Christopherson wrote:
In general, I am_very_ opposed to blindly retrying an SEPT SEAMCALL, ever. For
its operations, I'm pretty sure the only sane approach is for KVM to ensure there
will be no contention. And if the TDX module's single-step protection spuriously
kicks in, KVM exits to userspace. If the TDX module can't/doesn't/won't communicate
that it's mitigating single-step, e.g. so that KVM can forward the information
to userspace, then that's a TDX module problem to solve.

In principle I agree but we also need to be pragmatic. Exiting to userspace may not be practical in all flows, for example.

First of all, we can add a spinlock around affected seamcalls. This way we know that "busy" errors must come from the guest and have set HOST_PRIORITY. It is still kinda bad that guests can force the VMM to loop, but the VMM can always say enough is enough. In other words, let's assume that a limit of 16 is probably appropriate but we can also increase the limit and crash the VM if things become ridiculous.

Something like this:

static u32 max = 16;
int retry = 0;
spin_lock(&kvm->arch.seamcall_lock);
for (;;) {
args_in = *in;
ret = seamcall_ret(op, in);
if (++retry == 1) {
/* protected by the same seamcall_lock */
kvm->stat.retried_seamcalls++;
} else if (retry == READ_ONCE(max)) {
pr_warn("Exceeded %d retries for S-EPT operation\n", max);
if (KVM_BUG_ON(kvm, retry == 1024)) {
pr_err("Crashing due to lock contention in the TDX module\n");
break;
}
cmpxchg(&max, retry, retry * 2);
}
}
spin_unlock(&kvm->arch.seamcall_lock);

This way we can do some testing and figure out a useful limit.

For zero step detection, my reading is that it's TDH.VP.ENTER that fails; not any of the MEM seamcalls. For that one to be resolved, it should be enough to do take and release the mmu_lock back to back, which ensures that all pending critical sections have completed (that is, "write_lock(&kvm->mmu_lock); write_unlock(&kvm->mmu_lock);"). And then loop. Adding a vCPU stat for that one is a good idea, too.

Paolo