Re: [PATCH 1/1] x86/tdx: Route safe halt execution via tdx_safe_halt

From: Sean Christopherson
Date: Wed Jan 29 2025 - 09:00:40 EST


On Wed, Jan 29, 2025, Kirill A. Shutemov wrote:
> On Tue, Jan 28, 2025 at 04:45:35PM -0800, Sean Christopherson wrote:
> > This incorrectly assumes the hypervisor is intercepting HLT. If the VM is given
> > a slice of hardware, HLT-exiting may be disabled, in which case it's desirable
> > for the guest to natively execute HLT, as the latencies to get in and out of "HLT"
> > are lower, especially for TDX guests. Such a VM would hopefully have MONITOR/MWAIT
> > available as well, but even if that were the case, the admin could select HLT for
> > idling.
> >
> > Ugh, and I see that bfe6ed0c6727 ("x86/tdx: Add HLT support for TDX guests")
> > overrides default_idle(). The kernel really shouldn't do that, because odds are
> > decent that any TDX guest will have direct access to HLT. The best approach I
> > can think of would be to patch x86_idle() to tdx_safe_halt() if and only if a HLT
> > #VE is taken. The tricky part would be delaying the update until it's safe to do
> > so.
>
> I am confused. HLT triggers #VE unconditionally in TDX guests. How would
> TDX guest have direct access to HLT?

Gah, you're not confused, I am. I was thinking of the SEV-ES model where intercepts
are morphed to #VC.

> Even if it would in the future, it is going to explicit opt-in from the
> guest and we can avoid setting x86_idle() for such cases.

Or explicit enumeration from the TDX module.

> > As for taking a #VE, the exception itself is fine (assuming the kernel isn't off
> > the rails and using a trap gate :-D). The issue is likely that RFLAGS.IF=1 on
> > the stack, and so the call to cond_local_irq_enable() enables IRQs before making
> > the hypercall. E.g. no one has complained about #VC, because exc_vmm_communication()
> > doesn't enable IRQs.
> >
> > Off the top of my head, I can't think of any flows that would do HLT with IRQs
> > fully enabled. Even PV spinlocks use safe_halt(), e.g. in kvm_wait(), so I don't
> > think there's any value in trying to precisely identify that it's a safe HLT?
>
> I can only think of "CPU is dead" use-case of HLT where interrupts are
> enabled. But I hate special-casing HLT in exc_virtualization_exception() :/

Ignore me, overriding at boot time is the way to go.

> > E.g. this should fix the immediate problem, and then ideally someone would make
> > TDX guests play nice with native HLT.
>
> I've asked (some time ago) TDX module folks to provide interruptibility
> state as part of the guest so we can handle STI shadow properly, not as a
> hack around HLT.
>
> The immediate problem can be addressed by fixing the BIOS to not advertise
> C-states (if I read the situation right).

No, something like Vishal proposed is a better fix. It's still desirable for the
vCPU to call out to the hypervisor when going idle, otherwise a vCPU that is idle
for an extended duration will never let the pCPU go idle.