Re: [PATCHv5 11/30] x86/tdx: Handle in-kernel MMIO

From: Dave Hansen
Date: Thu Mar 10 2022 - 12:53:15 EST


On 3/10/22 08:48, Kirill A. Shutemov wrote:
> On Wed, Mar 09, 2022 at 05:06:11PM -0800, Dave Hansen wrote:
>> On 3/9/22 16:51, Kirill A. Shutemov wrote:
>>> On Tue, Mar 08, 2022 at 01:26:28PM -0800, Dave Hansen wrote:
>>>> Memory encryption has zero to do with this. The TDX isolation
>>>> mechanisms are totally discrete from memory encryption, although they
>>>> are "neighbors" of sorts.
>>>
>>> Hm. I don't see why you say encryption is not relevant. VMM (host kernel)
>>> has ultimate access to guest memory cypher text. It can read it as
>>> cypher text without any issue (using KeyID-0).
>>>
>>> Could you elaborate on the point?
>>
>> I think you're just confusing what TDX has with MKTME. The whitepaper says:
>>
>>> The TD-bit associated with the line in memory seeks to
>>> detect software or devices attempting to read memory
>>> encrypted with private KeyID, using a shared KeyID, to reveal
>>> the ciphertext. On such accesses, the MKTME returns a fixed
>>> pattern to prevent ciphertext analysis.
>>
>> I think several firstborn were sacrificed to get that bit. Let's not
>> forget why we have it. :)
>
> Okay, I missed the memo. I will drop reference to encryption:
>
> - The CPU disallows software other than the TDX module and TDs from
> making memory accesses using the private key. Without the correct
> key VMM has no way to access TD-private memory.

I think this is good enough:

- All guest code is expected to be in TD-private memory. Being
private to the TD, VMMs have no way to access TD-private memory and
no way to read the instruction to decode and emulate it.

We don't have to rehash what private memory is or how it is implemented.

>> This problem was laid out as having three cases:
>> 1. virtio
>> 2. x86-specific drivers
>> 3. random drivers (everything else)
>>
>> #1 could be done with paravirt
>> #2 is unspecified and unknown
>> #3 use doesn't as far as I know exist in TDX guests today
>
> #2 doesn't matter from performance point of view and there is no
> convenient place where they can be intercepted as they are scattered
> across the tree. Patching them doesn't bring any benefit, only pain.

I'd feel a lot better if this was slightly better specified. Even
booting with a:

printf("rip: %lx\n", regs->rip);

in the #VE handler would give some hard data about these. This still
feels to me like something that Sean got booting two years ago and
nobody has really reconsidered.

> #3 some customers already declared that they will use device passthough
> (yes, it is not safe). CSP may want to emulate random device, depending on
> setup. Like, a power button or something.

I'm not sure I'm totally on board with that.

But, let's try to make a coherent changelog out of that mess.

This approach is bad for performance. But, it has (virtually)
no impact on the size of the kernel image and will work for a
wide variety of drivers. This allows TDX deployments to use
arbitrary devices and device drivers, including virtio. TDX
customers have asked for the capability to use random devices in
their deployments.

In other words, even if all of the work was done to
paravirtualize all x86 MMIO users and virtio, this approach
would still be needed. There is essentially no way to get rid
of this code.

This approach is functional for all in-kernel MMIO users current
and future and does so with a minimal amount of code and kernel
image bloat.

Does that summarize it?

>>> BUG() here makes it clear that the handler itself is buggy. Returning
>>> false and kicking in #GP-like logic indicates that something wrong with
>>> the code that triggered #VE. I think it is an important distinction.
>>
>> OK, then how about a WARN_ON() which is followed by the #GP?
>
> You folks give mixed messages. Thomas was very unhappy when I tried to add
> code that recovers from WBINVD:
>
> https://lore.kernel.org/all/87y22uujkm.ffs@tglx
>
> It is exactly the same scenario: kernel code is buggy and has to be fixed.
>
> So, what the policy?

Lately, I've tried to subscribe to the "there is NO F*CKING EXCUSE to
knowingly kill the kernel" policy[1].

You don't add a BUG_ON() unless the kernel has no meaningful way to
continue. It's not for a "hey, that's weird..." kind of thing. Like,
"hey, the instruction decoder looks confused, that's weird."

>> Let's say insn_decode_mmio() does something insane like:
>>
>> return -EINVAL;
>>
>> Should we really be killing the kernel for that?
>
> Note that #GP is most likely kill kernel as well. We handle in-kernel
> MMIO. There are no many chances for recover.
>
> Is it really the big deal?

No, not really.

But, I'd like to see one of two things:
1. Change the BUG()s to WARN()s.
2. Make it utterly clear that handle_mmio() is for handling kernel MMIO
only. Definitely change the naming, possibly add a check for
user_mode(). In other words, make it even _less_ generic.

None of that should be hard.

BTW, the BUG()s made me think about how the gp_try_fixup_and_notify()
code would work for MMIO. For instance, are there any places where
fixup might be done for MMIO? If so, an earlier BUG() wouldn't allow
the fixup to occur.

Do we *WANT* #VE's to be exposed to the #GP fixup machinery?

1.
https://lore.kernel.org/all/CA+55aFwyNTLuZgOWMTRuabWobF27ygskuxvFd-P0n-3UNT=0Og@xxxxxxxxxxxxxx/