Re: [RFC v1 12/26] x86/tdx: Handle in-kernel MMIO

From: Dave Hansen
Date: Thu Apr 01 2021 - 18:53:53 EST


On 4/1/21 3:26 PM, Sean Christopherson wrote:
> On Thu, Apr 01, 2021, Dave Hansen wrote:
>> On 2/5/21 3:38 PM, Kuppuswamy Sathyanarayanan wrote:
>>> From: "Kirill A. Shutemov" <kirill.shutemov@xxxxxxxxxxxxxxx>
>>>
>>> Handle #VE due to MMIO operations. MMIO triggers #VE with EPT_VIOLATION
>>> exit reason.
>>>
>>> For now we only handle subset of instruction that kernel uses for MMIO
>>> oerations. User-space access triggers SIGBUS.
>> ..
>>> + case EXIT_REASON_EPT_VIOLATION:
>>> + ve->instr_len = tdx_handle_mmio(regs, ve);
>>> + break;
>>
>> Is MMIO literally the only thing that can cause an EPT violation for TDX
>> guests?
>
> Any EPT Violation, or specifically EPT Violation #VE? Any memory access can
> cause an EPT violation, but the VMM will get the ones that lead to VM-Exit. The
> guest will only get the ones that cause #VE.

I'll rephrase: Is MMIO literally the only thing that can cause us to get
into the EXIT_REASON_EPT_VIOLATION case of the switch() here?

> Assuming you're asking about #VE... No, any shared memory access can take a #VE
> since the VMM controls the shared EPT tables and can clear the SUPPRESS_VE bit
> at any time. But, if the VMM is friendly, #VE should be limited to MMIO.

OK, but what are we doing in the case of unfriendly VMMs? What does
*this* code do as-is, and where do we want to take it?

>From the _looks_ of this patch, tdx_handle_mmio() is the be all end all
solution to all EXIT_REASON_EPT_VIOLATION events.

>> But for an OS where we have source for the *ENTIRE* thing, and where we
>> have a chokepoint for MMIO accesses (arch/x86/include/asm/io.h), it
>> seems like an *AWFUL* idea to:
>> 1. Have the kernel set up special mappings for I/O memory
>> 2. Kernel generates special instructions to access that memory
>> 3. Kernel faults on that memory
>> 4. Kernel cracks its own special instructions to see what they were
>> doing
>> 5. Kernel calls up to host to do the MMIO
>>
>> Instead of doing 2/3/4, why not just have #2 call up to the host
>> directly? This patch seems a very slow, roundabout way to do
>> paravirtualized MMIO.
>>
>> BTW, there's already some SEV special-casing in io.h.
>
> I implemented #2 a while back for build_mmio_{read,write}(), I'm guessing the
> code is floating around somewhere. The gotcha is that there are nasty little
> pieces of the kernel that don't use the helpers provided by io.h, e.g. the I/O
> APIC code likes to access MMIO via a struct overlay, so the compiler is free to
> use any instruction that satisfies the constraint.

So, there aren't an infinite number of these. It's also 100% possible
to add some tooling to the kernel today to help you find these. You
could also have added tooling to KVM hosts to help find these.

Folks are *also* saying that we'll need a driver audit just to trust
that drivers aren't vulnerable to attacks from devices or from the host.
This can quite easily be a part of that effort.

> The I/O APIC can and should be forced off, but dollars to donuts says there are
> more special snowflakes lying in wait. If the kernel uses an allowlist for
> drivers, then in theory it should be possible to hunt down all offenders. But
> I think we'll want fallback logic to handle kernel MMIO #VEs, especially if the
> kernel needs ISA cracking logic for userspace. Without fallback logic, any MMIO
> #VE from the kernel would be fatal, which is too harsh IMO since the behavior
> isn't so obviously wrong, e.g. versus the split lock #AC purge where there's no
> legitimate reason for the kernel to generate a split lock.

I'll buy that this patch is convenient for *debugging*. It helped folks
bootstrap the TDX support and get it going.

IMNHO, if a driver causes a #VE, it's a bug. Just like if it goes off
the rails and touches bad memory and #GP's or #PF's.

Are there any printk's in the #VE handler? Guess what those do. Print
to the console. Guess what consoles do. MMIO. You can't get away from
doing audits of the console drivers. Sure, you can go make #VE special,
like NMIs, but that's not going to be fun. At least the guest doesn't
have to deal with the fatality of a nested #VE, but it's still fatal.

I just don't like us pretending that we're Windows and have no control
over the code we run and throwing up our hands.