Re: [RFC v2-fix 1/1] x86/tdx: Handle in-kernel MMIO

From: Dave Hansen
Date: Tue May 18 2021 - 11:01:06 EST


On 5/17/21 5:48 PM, Kuppuswamy Sathyanarayanan wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@xxxxxxxxxxxxxxx>
>
> In traditional VMs, MMIO tends to be implemented by giving a
> guest access to a mapping which will cause a VMEXIT on access.
> That's not possible in TDX guest.

Why is it not possible?

> So use #VE to implement MMIO support. In TDX guest, MMIO triggers #VE
> with EPT_VIOLATION exit reason.

What does the #VE handler do to resolve the exception?

> For now we only handle a subset of instructions that the kernel
> uses for MMIO operations. User-space access triggers SIGBUS.

How do you know which instructions the kernel uses? How do you know
that the compiler won't change them?

I guess the kernel won't boot far if this happens, but this still sounds
like trial-and-error programming.

> Also, reasons for supporting #VE based MMIO in TDX guest are,
>
> * MMIO is widely used and we'll have more drivers in the future.

OK, but you've also made a big deal about having to go explicitly audit
these drivers. I would imagine converting these over to stop using MMIO
would be _relatively_ minor compared to a big security audit and new
fuzzing infrastructure.

> * We don't want to annotate every TDX specific MMIO readl/writel etc.

^ TDX-specific

> * If we didn't annotate we would need to add an alternative to every
> MMIO access in the kernel (even though 99.9% will never be used on
> TDX) which would be a complete waste and incredible binary bloat
> for nothing.

That sounds like something objective we can measure. Does this cost 1
byte of extra text per readl/writel? 10? 100?

You're also being rather indirect about what solutions you ruled out.
Why not just say: we considered doing ____, but ruled that out because
it would have required ____. Above you just tell us what the solution
required without mentioning the solution.

> diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
> index b9e3010987e0..9330c7a9ad69 100644
> --- a/arch/x86/kernel/tdx.c
> +++ b/arch/x86/kernel/tdx.c
> @@ -5,6 +5,8 @@
>
> #include <asm/tdx.h>
> #include <asm/vmx.h>
> +#include <asm/insn.h>
> +#include <linux/sched/signal.h> /* force_sig_fault() */
>
> #include <linux/cpu.h>
> #include <linux/protected_guest.h>
> @@ -209,6 +211,101 @@ static void tdg_handle_io(struct pt_regs *regs, u32 exit_qual)
> }
> }
>
> +static unsigned long tdg_mmio(int size, bool write, unsigned long addr,
> + unsigned long val)
> +{
> + return tdx_hypercall_out_r11(EXIT_REASON_EPT_VIOLATION, size,
> + write, addr, val);
> +}
> +
> +static inline void *get_reg_ptr(struct pt_regs *regs, struct insn *insn)
> +{
> + static const int regoff[] = {
> + offsetof(struct pt_regs, ax),
> + offsetof(struct pt_regs, cx),
> + offsetof(struct pt_regs, dx),
> + offsetof(struct pt_regs, bx),
> + offsetof(struct pt_regs, sp),
> + offsetof(struct pt_regs, bp),
> + offsetof(struct pt_regs, si),
> + offsetof(struct pt_regs, di),
> + offsetof(struct pt_regs, r8),
> + offsetof(struct pt_regs, r9),
> + offsetof(struct pt_regs, r10),
> + offsetof(struct pt_regs, r11),
> + offsetof(struct pt_regs, r12),
> + offsetof(struct pt_regs, r13),
> + offsetof(struct pt_regs, r14),
> + offsetof(struct pt_regs, r15),
> + };
> + int regno;
> +
> + regno = X86_MODRM_REG(insn->modrm.value);
> + if (X86_REX_R(insn->rex_prefix.value))
> + regno += 8;
> +
> + return (void *)regs + regoff[regno];
> +}

Was there a reason you copied and pasted this from get_reg_offset()
instead of refactoring? This looks like almost entirely a subset of
get_reg_offset().

> +static int tdg_handle_mmio(struct pt_regs *regs, struct ve_info *ve)
> +{
> + int size;
> + bool write;
> + unsigned long *reg;
> + struct insn insn;
> + unsigned long val = 0;
> +
> + /*
> + * User mode would mean the kernel exposed a device directly
> + * to ring3, which shouldn't happen except for things like
> + * DPDK.
> + */

Uhh....

https://www.kernel.org/doc/html/v4.14/driver-api/uio-howto.html

I thought there were more than a few ways that userspace could get
access to MMIO mappings.

Also, do most people know what DPDK is? Should we even be talking about
silly out-of-tree kernel bypass schemes in kernel comments?

> + if (user_mode(regs)) {
> + pr_err("Unexpected user-mode MMIO access.\n");
> + force_sig_fault(SIGBUS, BUS_ADRERR, (void __user *) ve->gla);

extra space ^

Is a non-ratelimited pr_err() appropriate here? I guess there shouldn't
be any MMIO passthrough to userspace on these systems.

> + return 0;
> + }
> +
> + kernel_insn_init(&insn, (void *) regs->ip, MAX_INSN_SIZE);
> + insn_get_length(&insn);
> + insn_get_opcode(&insn);
> +
> + write = ve->exit_qual & 0x2;
> +
> + size = insn.opnd_bytes;
> + switch (insn.opcode.bytes[0]) {
> + /* MOV r/m8 r8 */
> + case 0x88:
> + /* MOV r8 r/m8 */
> + case 0x8A:
> + /* MOV r/m8 imm8 */
> + case 0xC6:

FWIW, I find that *REALLY* hard to read.

Check out is_string_insn() for a more readable example.

Oh, and I misread that. I read it as "these are all the opcodes we care
about". When, in fact, I _think_ it's all the opcodes that don't have a
size in insn.opnd_bytes.

Could you spell that out, please?

> + size = 1;
> + break;
> + }
> +
> + if (inat_has_immediate(insn.attr)) {
> + BUG_ON(!write);
> + val = insn.immediate.value;

This is pretty interesting. This won't work with implicit accesses. I
guess the limited opcodes above limit how much imprecision will result.
But, it would still be nice to hear something about that.

For instance, if someone pointed a mid-level page table to MMIO, we'd
get a va->gpa that had zero to do with the instruction. Granted, that's
only going to happen if something bonkers is going on, but maybe I'm
missing some simpler cases of implicit accesses.

> + tdg_mmio(size, write, ve->gpa, val);

What happens if this is an MMIO operation that *partially* touches MMIO
and partially touches normal memory? Let's say I wrote two bytes
(0x1234), starting at the last byte of a RAM page that ran over into an
MMIO page. The fault would occur trying to write 0x34 to the MMIO, but
the instruction cracking would result in trying to write 0x1234 into the
MMIO.

It doesn't seem *that* outlandish that an MMIO might cross a page
boundary. Would this work for a two-byte MMIO that crosses a page?

> + return insn.length;
> + }
> +
> + BUG_ON(!inat_has_modrm(insn.attr));

A comment would be nice here about the BUG_ON().

It would also be nice to give a high-level view of what's going on and
what we know about the instruction at this point.

> + reg = get_reg_ptr(regs, &insn);
> +
> + if (write) {
> + memcpy(&val, reg, size);
> + tdg_mmio(size, write, ve->gpa, val);
> + } else {
> + val = tdg_mmio(size, write, ve->gpa, val);
> + memset(reg, 0, size);
> + memcpy(reg, &val, size);
> + }
> + return insn.length;
> +}
> +
> unsigned long tdg_get_ve_info(struct ve_info *ve)
> {
> u64 ret;
> @@ -258,6 +355,9 @@ int tdg_handle_virtualization_exception(struct pt_regs *regs,
> case EXIT_REASON_IO_INSTRUCTION:
> tdg_handle_io(regs, ve->exit_qual);
> break;
> + case EXIT_REASON_EPT_VIOLATION:
> + ve->instr_len = tdg_handle_mmio(regs, ve);
> + break;
> default:
> pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
> return -EFAULT;
>