Re: [PATCH V2 1/1] x86/tdx: Route safe halt execution via tdx_safe_halt()

From: Dave Hansen
Date: Tue Feb 04 2025 - 12:33:04 EST


I think this is the right fix for _now_. In practice, Vishal's problem
only occurs on CONFIG_PARAVIRT_XXL systems. His proposed fix here does
not make TDX depend on CONFIG_PARAVIRT_XXL, it just provides an extra
override when TDX and CONFIG_PARAVIRT_XXL collide.

This seems like a reasonable compromise that avoids entangling
PARAVIRT_XXL and TDX _too_ much and also avoids reinventing a hunk of
PARAVIRT_XXL just to fix this bug.

Long-term, I think it would be nice to move pv_ops.irq.safe_halt() away
from being a paravirt thing and move it over to a plain static_call().

Then, TDX can get rid of this hunk:

pr_info("using TDX aware idle routine\n");
static_call_update(x86_idle, tdx_safe_halt);

and move back to default_idle() which could look like this:

void __cpuidle default_idle(void)
{
- raw_safe_halt();
+ static_call(x86_safe_halt)();
raw_local_irq_disable();
}

If 'x86_safe_halt' was the only route in the kernel to call 'sti;hlt'
then we can know with pretty high confidence if TDX or Xen code sets
their own 'x86_safe_halt' that they won't run into more bugs like this one.

On to the patch itself...

On 1/29/25 15:25, Vishal Annapurve wrote:
> Direct HLT instruction execution causes #VEs for TDX VMs which is routed
> to hypervisor via tdvmcall. This process renders HLT instruction
> execution inatomic, so any preceding instructions like STI/MOV SS will
> end up enabling interrupts before the HLT instruction is routed to the
> hypervisor. This creates scenarios where interrupts could land during
> HLT instruction emulation without aborting halt operation leading to
> idefinite halt wait times.

Vishal! I'm noticing spelling issues right up front and center here,
just like in v1. I think I asked nicely last time if you could start
spell checking your changelogs before posting v2. Any chance you could
actually put some spell checking in place before v3?

So, the x86 STI-shadow mechanism has left a trail of tears. We don't
want to explain the whole sordid tale here, but I don't feel like
talking about the "what" (atomic vs. inatomic execution) without
explaining "why" is really sufficient to explain the problem at hand.

Sean had a pretty concise description in here that I liked:

https://lore.kernel.org/all/Z5l6L3Hen9_Y3SGC@xxxxxxxxxx/

But the net result is that it is currently unsafe for TDX guests to use
the "sti;hlt" sequence. It's really important to say *that* somewhere.

> Commit bfe6ed0c6727 ("x86/tdx: Add HLT support for TDX guests") already
> upgraded x86_idle() to invoke tdvmcall to avoid such scenarios, but
> it didn't cover pv_native_safe_halt() which can be invoked using
> raw_safe_halt() from call sites like acpi_safe_halt().

Does this convey the same thing?

Commit bfe6ed0c6727 ("x86/tdx: Add HLT support for TDX guests")
prevented the idle routines from using "sti;hlt". But it missed the
paravirt routine. That can be reached like this, for example:

acpi_safe_halt() =>
raw_safe_halt() =>
arch_safe_halt() =>
irq.safe_halt() =>
pv_native_safe_halt()

I also dislike the "upgrade" nomenclature. It's not really an "upgrade".

...
> @@ -380,13 +381,18 @@ static int handle_halt(struct ve_info *ve)
> {
> const bool irq_disabled = irqs_disabled();
>
> + if (!irq_disabled) {
> + WARN_ONCE(1, "HLT instruction emulation unsafe with irqs enabled\n");
> + return -EIO;
> + }

The warning is fine, but I do think it should be separated from the bug fix.

>
> -void __cpuidle tdx_safe_halt(void)
> +void __cpuidle tdx_idle(void)
> {
> const bool irq_disabled = false;
>
> @@ -397,6 +403,12 @@ void __cpuidle tdx_safe_halt(void)
> WARN_ONCE(1, "HLT instruction emulation failed\n");
> }
>
> +static void __cpuidle tdx_safe_halt(void)
> +{
> + tdx_idle();
> + raw_local_irq_enable();
> +}

The naming here is a bit wonky. Think of how the call chain will look:

irq.safe_halt() =>
tdx_safe_halt() =>
tdx_idle() =>
__halt()

See how it's doing a more and more TDX-specific halt operation? Isn't
the "idle" call right in the middle confusing?

> static int read_msr(struct pt_regs *regs, struct ve_info *ve)
> {
> struct tdx_module_args args = {
> @@ -1083,6 +1095,15 @@ void __init tdx_early_init(void)
> x86_platform.guest.enc_kexec_begin = tdx_kexec_begin;
> x86_platform.guest.enc_kexec_finish = tdx_kexec_finish;
>
> +#ifdef CONFIG_PARAVIRT_XXL
> + /*
> + * halt instruction execution is not atomic for TDX VMs as it generates
> + * #VEs, so otherwise "safe" halt invocations which cause interrupts to
> + * get enabled right after halt instruction don't work for TDX VMs.
> + */
> + pv_ops.irq.safe_halt = tdx_safe_halt;
> +#endif

Just like the changelog, it's hard to write a good comment without going
into the horrors of the STI-shadow. But I think this is a bit more to
the point:

/*
* Avoid the literal hlt instruction in TDX guests. hlt will
* induce a #VE in the STI-shadow which will enable interrupts
* in a place where they are not wanted.
*/