Re: [RFC v2-fix 1/1] x86/tdx: Handle in-kernel MMIO

From: Sean Christopherson
Date: Tue May 18 2021 - 12:18:41 EST


On Tue, May 18, 2021, Dave Hansen wrote:
> On 5/17/21 5:48 PM, Kuppuswamy Sathyanarayanan wrote:
> > From: "Kirill A. Shutemov" <kirill.shutemov@xxxxxxxxxxxxxxx>
> >
> > In traditional VMs, MMIO tends to be implemented by giving a
> > guest access to a mapping which will cause a VMEXIT on access.
> > That's not possible in TDX guest.
>
> Why is it not possible?

It is possible, and in fact KVM will cause a VM-Exit on the first access to a
given MMIO page. The problem is that guest state is inaccessible and so the VMM
cannot do the front end of MMIO instruction emulation.

> > So use #VE to implement MMIO support. In TDX guest, MMIO triggers #VE
> > with EPT_VIOLATION exit reason.

It's more accurate to say that the VMM will configure EPT entries for pages that
require instruction emulation to cause #VE.

> What does the #VE handler do to resolve the exception?
>
> > For now we only handle a subset of instructions that the kernel
> > uses for MMIO operations. User-space access triggers SIGBUS.
>
> How do you know which instructions the kernel uses? How do you know
> that the compiler won't change them?
>
> I guess the kernel won't boot far if this happens, but this still sounds
> like trial-and-error programming.

If a driver accesses MMIO through a struct overlay, all bets are off. The I/O
APIC code does this, but that problem is "solved" by forcefully disabling the
I/O APIC since it's useless for the current incarnation of TDX. IIRC, some of
the console code also accesses MMIO via a struct (or maybe just through generic
C code), and the compiler does indeed employ a wider variety of instructions.

So yeah, whack-a-mole.

> > Also, reasons for supporting #VE based MMIO in TDX guest are,
> >
> > * MMIO is widely used and we'll have more drivers in the future.
>
> OK, but you've also made a big deal about having to go explicitly audit
> these drivers. I would imagine converting these over to stop using MMIO
> would be _relatively_ minor compared

For drivers that use the kernel's macros, converting them to use TDVMCALL
directly will be trivial and shouldn't even require any modifications to the
driver. For drivers that use a struct overlay or generic C code, the "conversion"
could require a complete rewrite of the driver.

> to a big security audit and new fuzzing infrastructure.
>
> > * We don't want to annotate every TDX specific MMIO readl/writel etc.
>
> ^ TDX-specific
>
> > * If we didn't annotate we would need to add an alternative to every
> > MMIO access in the kernel (even though 99.9% will never be used on
> > TDX) which would be a complete waste and incredible binary bloat
> > for nothing.
>
> That sounds like something objective we can measure. Does this cost 1
> byte of extra text per readl/writel? 10? 100?

Agreed. And IMO, it's worth converting the common case (macros) if the overhead
is acceptable, while leaving the #VE handling in place for non-standard code.

> > +static int tdg_handle_mmio(struct pt_regs *regs, struct ve_info *ve)

...

> > + return 0;
> > + }
> > +
> > + kernel_insn_init(&insn, (void *) regs->ip, MAX_INSN_SIZE);
> > + insn_get_length(&insn);
> > + insn_get_opcode(&insn);
> > +
> > + write = ve->exit_qual & 0x2;
> > +
> > + size = insn.opnd_bytes;
> > + switch (insn.opcode.bytes[0]) {
> > + /* MOV r/m8 r8 */
> > + case 0x88:
> > + /* MOV r8 r/m8 */
> > + case 0x8A:
> > + /* MOV r/m8 imm8 */
> > + case 0xC6:
>
> FWIW, I find that *REALLY* hard to read.

Why does this code exist at all? TDX and SEV-ES absolutely must share code for
handling MMIO reflection. It will require a fair amount of refactoring to move
the guts of vc_handle_mmio() to common code, but there is zero reason to maintain
two separate versions of the opcode cracking.

Ditto for string I/O in vc_handle_ioio().

> What happens if this is an MMIO operation that *partially* touches MMIO
> and partially touches normal memory? Let's say I wrote two bytes
> (0x1234), starting at the last byte of a RAM page that ran over into an
> MMIO page. The fault would occur trying to write 0x34 to the MMIO, but
> the instruction cracking would result in trying to write 0x1234 into the
> MMIO.
>
> It doesn't seem *that* outlandish that an MMIO might cross a page
> boundary. Would this work for a two-byte MMIO that crosses a page?

I'm pretty sure we can get away with panic (kernel) and SIGBUS (userspace) on
a reflected memory access that splits a page. Yes, it's theoretically possible
and probably even "works", but practically speaking no emulated MMIO device is
going to have a single logical register/thing split a page, and I can't think of
any reason to allow accessing multiple registers/things across a page split.

The existing SEV-ES #VC handlers appear to be missing page split checks, so that
needs to be fixed.