Re: [PATCH 09/21] KVM: TDX: Retry seamcall when TDX_OPERAND_BUSY with operand SEPT
From: Yan Zhao
Date: Thu Oct 10 2024 - 22:33:22 EST
On Fri, Oct 11, 2024 at 05:53:29AM +0800, Edgecombe, Rick P wrote:
> On Thu, 2024-10-10 at 10:33 -0700, Sean Christopherson wrote:
> > >
> > > 1st: "fault->is_private != kvm_mem_is_private(kvm, fault->gfn)" is found.
> > > 2nd-6th: try_cmpxchg64() fails on each level SPTEs (5 levels in total)
>
> Isn't there a more general scenario:
>
> vcpu0 vcpu1
> 1. Freezes PTE
> 2. External op to do the SEAMCALL
> 3. Faults same PTE, hits frozen PTE
> 4. Retries N times, triggers zero-step
> 5. Finally finishes external op
>
> Am I missing something?
Yes, it's a follow-up discussion of Sean's proposal [1] of having TDX code to
retry on RET_PF_RETRY_FROZEN to avoid zero-step.
My worry is that merely avoiding entering guest for vCPUs seeing FROZEN_SPTE is
not enough to prevent zero-step.
The two examples shows zero-step is possible without re-entering guest for
FROZEN_SPTE:
- The selftest [2]: a single vCPU can fire zero-step when userspace does
something wrong (though KVM is correct).
- The above case: Nothing wrong in KVM/QEMU, except an extremely unlucky vCPU.
[1] https://lore.kernel.org/all/ZuR09EqzU1WbQYGd@xxxxxxxxxx/
[2] https://lore.kernel.org/all/ZvPrqMj1BWrkkwqN@xxxxxxxxxxxxxxxxxxxxxxxxx/
>
> >
> > Very technically, this shouldn't be possible. The only way for there to be
> > contention on the leaf SPTE is if some other KVM task installed a SPTE, i.e.
> > the
> > 6th attempt should succeed, even if the faulting vCPU wasn't the one to create
> > the SPTE.
> >
> > That said, a few thoughts:
> >
> > 1. Where did we end up on the idea of requiring userspace to pre-fault memory?
>
> For others reference, I think you are referring to the idea to pre-fault the
> entire S-EPT even for GFNs that usually get AUGed, not the mirrored EPT pre-
> faulting/PAGE.ADD dance we are already doing.
>
> The last discussion with Paolo was to resume the retry solution discussion on
> the v2 posting because it would be easier "with everything else already
> addressed". Also, there was also some discussion that it was not immediately
> obvious how prefaulting everything would work for memory hot plug (i.e. memslots
> added during runtime).
>
> >
> > 2. The zero-step logic really should have a slightly more conservative
> > threshold.
> > I have a hard time believing that e.g. 10 attempts would create a side
> > channel,
> > but 6 attempts is "fine".
>
> No idea where the threshold came from. I'm not sure if it affects the KVM
> design? We can look into it for curiosity sake in either case.
>
> >
> > 3. This would be a good reason to implement a local retry in
> > kvm_tdp_mmu_map().
> > Yes, I'm being somewhat hypocritical since I'm so against retrying for the
> > S-EPT case, but my objection to retrying for S-EPT is that it _should_ be
> > easy
> > for KVM to guarantee success.
> >
> > E.g. for #3, the below (compile tested only) patch should make it impossible
> > for
> > the S-EPT case to fail, as dirty logging isn't (yet) supported and mirror
> > SPTEs
> > should never trigger A/D assists, i.e. retry should always succeed.
>
> I don't see how it addresses the scenario above. More retires could just make it
> rarer, but never fix it. Very possible I'm missing something though.
I'm also not 100% sure if zero-step must not happen after this change even when
KVM/QEMU do nothing wrong.