Re: [PATCH v19 007/130] x86/virt/tdx: Export SEAMCALL functions

From: kirill.shutemov@xxxxxxxxxxxxxxx
Date: Fri Apr 19 2024 - 10:46:20 EST


On Thu, Apr 18, 2024 at 11:26:11AM -0700, Sean Christopherson wrote:
> On Thu, Apr 18, 2024, kirill.shutemov@xxxxxxxxxxxxxxx wrote:
> > On Tue, Apr 16, 2024 at 07:45:18PM +0000, Edgecombe, Rick P wrote:
> > > On Wed, 2024-04-10 at 15:49 +0300, Kirill A. Shutemov wrote:
> > > > On Fri, Mar 15, 2024 at 09:33:20AM -0700, Sean Christopherson wrote:
> > > > > So my feedback is to not worry about the exports, and instead focus on
> > > > > figuring
> > > > > out a way to make the generated code less bloated and easier to read/debug.
> > > >
> > > > I think it was mistake trying to centralize TDCALL/SEAMCALL calls into
> > > > few megawrappers. I think we can get better results by shifting leaf
> > > > function wrappers into assembly.
> > > >
> > > > We are going to have more assembly, but it should produce better result.
> > > > Adding macros can help to write such wrapper and minimizer boilerplate.
> > > >
> > > > Below is an example of how it can look like. It's not complete. I only
> > > > converted TDCALLs, but TDVMCALLs or SEAMCALLs. TDVMCALLs are going to be
> > > > more complex.
> > > >
> > > > Any opinions? Is it something worth investing more time?
> > >
> > > We discussed offline how implementing these for each TDVM/SEAMCALL increases the
> > > chances of a bug in just one TDVM/SEAMCALL. Which could making debugging
> > > problems more challenging. Kirill raised the possibility of some code generating
> > > solution like cpufeatures.h, that could take a spec and generate correct calls.
> > >
> > > So far no big wins have presented themselves. Kirill, do we think the path to
> > > move the messy part out-of-line will not work?
> >
> > I converted all TDCALL and TDVMCALL leafs to direct assembly wrappers.
> > Here's WIP branch: https://github.com/intel/tdx/commits/guest-tdx-asm/
> >
> > I still need to clean it up and write commit messages and comments for all
> > wrappers.
> >
> > Now I think it worth the shot.
> >
> > Any feedback?
>
> I find it hard to review for correctness, and extremely susceptible to developer
> error. E.g. lots of copy+paste, and manual encoding of RCX to expose registers.

Yes, I agree. The approach requires careful manual work and is error-prone.
I was planning to get around and stare at every wrapper to make sure it is correct.
This approach is not scalable, though.

> It also bleeds TDX ABI into C code, e.g.
>
> /*
> * As per TDX GHCI CPUID ABI, r12-r15 registers contain contents of
> * EAX, EBX, ECX, EDX registers after the CPUID instruction execution.
> * So copy the register contents back to pt_regs.
> */
> regs->ax = args.r12;
> regs->bx = args.r13;
> regs->cx = args.r14;
> regs->dx = args.r15;
>
> Oh, and it requires input/output paramters, which is quite gross for C code *and*
> for assembly code, e.g.
>
> u64 tdvmcall_map_gpa(u64 *gpa, u64 size);
>
> and then the accompanying assembly code:
>
> FRAME_BEGIN
>
> save_regs r12,r13
>
> movq (%rdi), %r12
> movq %rsi, %r13
>
> movq $(TDX_R10 | TDX_R11 | TDX_R12 | TDX_R13), %rcx
>
> tdvmcall $TDVMCALL_MAP_GPA
>
> movq %r11, (%rdi)
>
> restore_regs r13,r12
>
> FRAME_END
> RET
>
> I think having one trampoline makes sense, e.g. to minimize the probability of
> leaking register state to the VMM. The part that I don't like, and which generates
> awful code, is shoving register state into a memory structure.

I don't think we can get away with single trampoline. We have outliers.

See TDG.VP.VMCALL<ReportFatalError> that uses pretty much all registers as
input. And I hope we wouldn't need TDG.VP.VMCALL<Instruction.PCONFIG> any
time soon. It uses all possible output registers.

But I guess we can make a *few* wrappers that covers all needed cases.

> The annoying part with the TDX ABI is that it heavily uses r8-r15, and asm()
> constraints don't play nice with r8-15. But that doesn't mean we can't use asm()
> with macros, it just means we have to play games with registers.
>
> Because REG-REG moves are super cheap, and ignoring the fatal error goofiness,
> there are at most four inputs. That means having a single trampoline take *all*
> possible inputs is a non-issue. And we can avoiding polluting the inline code if
> we bury the register shuffling in the trampoline.
>
> And if we use asm() wrappers to call the trampoline, then the trampoline doesn't
> need to precisely follow the C calling convention. I.e. the trampoline can return
> with the outputs still in r12-r15, and let the asm() wrappers extract the outputs
> they want.
>
> As it stands, TDVMCALLs either have 0, 1, or 4 outputs. I.e. we only need three
> asm() wrappers. We could get away with one wrapper, but then users of the wrappers
> would need dummy variables for inputs *and* outputs, and the outputs get gross.
>
> Completely untested, but this is what I'm thinking. Side topic, I think making
> "tdcall" a macro that takes a leaf is a mistake. If/when an assembler learns what
> tdcall is, we're going to have to rewrite all of that code. And what a coincidence,
> my suggestion needs a bare TDCALL! :-)

I guess rename the macro (if it still needed) to something like
tdcall_leaf or something.

> Side topic #2, I don't think the trampoline needs a stack frame, its a leaf function.

Hm. I guess.

> Side topic #3, the ud2 to induce panic should be out-of-line.

Yeah. I switched to the inline one while debugging one section mismatch
issue and forgot to switch back.

> Weird? Yeah. But at least we one need to document one weird calling convention,
> and the ugliness is contained to three macros and a small assembly function.

Okay, the approach is worth exploring. I can work on it.

You focuses here on TDVMCALL. What is your take on the rest of TDCALL?

> .pushsection .noinstr.text, "ax"
> SYM_FUNC_START(tdvmcall_trampoline)
> movq $TDX_HYPERCALL_STANDARD, %r10
> movq %rax, %r11
> movq %rdi, %r12
> movq %rsi, %r13
> movq %rdx, %r14
> movq %rcx, %r15
>
> movq $(TDX_R10 | TDX_R11 | TDX_R12 | TDX_R13 | TDX_R14 | TDX_R15), %rcx
>
> tdcall
>
> testq %rax, %rax
> jnz .Lpanic
>
> ret
>
> .Lpanic:
> ud2
> SYM_FUNC_END(tdvmcall_trampoline)
> .popsection
>
>
> #define TDVMCALL(reason, in1, in2, in3, in4) \
> ({ \
> long __ret; \
> \
> asm( \
> "call tdvmcall_trampoline\n\t" \
> "mov %%r10, %0\n\t" \
> : "=r" (__ret) \
> : "D" (in1), "S"(in2), "d"(in3), "c" (in4) \
> : "r12", "r13", "r14", "r15" \
> ); \
> __ret; \
> })
>
> #define TDVMCALL_1(reason, in1, in2, in3, in4, out1) \
> ({ \
> long __ret; \
> \
> asm( \
> "call tdvmcall_trampoline\n\t" \
> "mov %%r10, %0\n\t" \
> "mov %%r12, %1\n\t" \

It is r11, not r12.

> : "=r"(__ret) "=r" (out1) \
> : "a"(reason), "D" (in1), "S"(in2), "d"(in3), "c" (in4) \
> : "r12", "r13", "r14", "r15" \
> ); \
> __ret; \
> })
>
> #define TDVMCALL_4(reason, in1, in2, in3, in4, out1, out2, out3, out4) \
> ({ \
> long __ret; \
> \
> asm( \
> "call tdvmcall_trampoline\n\t" \
> "mov %%r10, %0\n\t" \
> "mov %%r12, %1\n\t" \
> "mov %%r13, %2\n\t" \
> "mov %%r14, %3\n\t" \
> "mov %%r15, %4\n\t" \
> : "=r" (__ret), \
> "=r" (out1), "=r" (out2), "=r" (out3), "=r" (out4) \
> : "a"(reason), "D" (in1), "S"(in2), "d"(in3), "c" (in4) \
> [reason] "i" (reason) \
> : "r12", "r13", "r14", "r15" \
> ); \
> __ret; \
> })
>
> static int handle_halt(struct ve_info *ve)
> {
> if (TDVMCALL(EXIT_REASON_HALT, irqs_disabled(), 0, 0, 0))
> return -EIO;
>
> return ve_instr_len(ve);
> }
>
> void __cpuidle tdx_safe_halt(void)
> {
> WARN_ONCE(TDVMCALL(EXIT_REASON_HALT, false, 0, 0, 0),
> "HLT instruction emulation failed");
> }
>
> static int read_msr(struct pt_regs *regs, struct ve_info *ve)
> {
> u64 val;
>
> if (TDVMCALL_1(EXIT_REASON_MSR_READ, regs->cx, 0, 0, 0, val))
> return -EIO;
>
> regs->ax = lower_32_bits(val);
> regs->dx = upper_32_bits(val);
>
> return ve_instr_len(ve);
> }
>
> static int write_msr(struct pt_regs *regs, struct ve_info *ve)
> {
> u64 val = (u64)regs->dx << 32 | regs->ax;
>
> if (TDVMCALL(EXIT_REASON_MSR_WRITE, regs->cx, val, 0, 0))
> return -EIO;
>
> return ve_instr_len(ve);
> }
> static int handle_cpuid(struct pt_regs *regs, struct ve_info *ve)
> {
> /*
> * Only allow VMM to control range reserved for hypervisor
> * communication.
> *
> * Return all-zeros for any CPUID outside the range. It matches CPU
> * behaviour for non-supported leaf.
> */
> if (regs->ax < 0x40000000 || regs->ax > 0x4FFFFFFF) {
> regs->ax = regs->bx = regs->cx = regs->dx = 0;
> return ve_instr_len(ve);
> }
>
> if (TDVMCALL_4(EXIT_REASON_CPUID, regs->ax, regs->cx, 0, 0,
> regs->ax, regs->bx, regs->cx, regs->dx))
> return -EIO;
>
> return ve_instr_len(ve);
> }
>
> static bool mmio_read(int size, u64 gpa, u64 *val)
> {
> *val = 0;
> return !TDVMCALL_1(EXIT_REASON_EPT_VIOLATION, size, EPT_READ, gpa, 0, val);
> }
>
> static bool mmio_write(int size, u64 gpa, u64 val)
> {
> return !TDVMCALL(EXIT_REASON_EPT_VIOLATION, size, EPT_WRITE, gpa, val);
> }

--
Kiryl Shutsemau / Kirill A. Shutemov