Re: [PATCH v3 00/21] TDX host kernel support

From: Dave Hansen
Date: Wed Apr 27 2022 - 17:59:06 EST


On 4/26/22 18:15, Kai Huang wrote:
> On Tue, 2022-04-26 at 13:13 -0700, Dave Hansen wrote:
>> On 4/5/22 21:49, Kai Huang wrote:
>>> SEAM VMX root operation is designed to host a CPU-attested, software
>>> module called the 'TDX module' which implements functions to manage
>>> crypto protected VMs called Trust Domains (TD). SEAM VMX root is also
>>
>> "crypto protected"? What the heck is that?
>
> How about "crypto-protected"? I googled and it seems it is used by someone
> else.

Cryptography itself doesn't provide (much) protection in the TDX
architecture. TDX guests are isolated from the VMM in ways that
traditional guests are not, but that has almost nothing to do with
cryptography.

Is it cryptography that keeps the host from reading guest private data
in the clear? Is it cryptography that keeps the host from reading guest
ciphertext? Does cryptography enforce the extra rules of Secure-EPT?

>>> 3. Memory hotplug
>>>
>>> The first generation of TDX architecturally doesn't support memory
>>> hotplug. And the first generation of TDX-capable platforms don't support
>>> physical memory hotplug. Since it physically cannot happen, this series
>>> doesn't add any check in ACPI memory hotplug code path to disable it.
>>>
>>> A special case of memory hotplug is adding NVDIMM as system RAM using
>>> kmem driver. However the first generation of TDX-capable platforms
>>> cannot enable TDX and NVDIMM simultaneously, so in practice this cannot
>>> happen either.
>>
>> What prevents this code from today's code being run on tomorrow's
>> platforms and breaking these assumptions?
>
> I forgot to add below (which is in the documentation patch):
>
> "This can be enhanced when future generation of TDX starts to support ACPI
> memory hotplug, or NVDIMM and TDX can be enabled simultaneously on the
> same platform."
>
> Is this acceptable?

No, Kai.

You're basically saying: *this* code doesn't work with feature A, B and
C. Then, you're pivoting to say that it doesn't matter because one
version of Intel's hardware doesn't support A, B, or C.

I don't care about this *ONE* version of the hardware. I care about
*ALL* the hardware that this code will ever support. *ALL* the hardware
on which this code will run.

In 5 years, if someone takes this code and runs it on Intel hardware
with memory hotplug, CPU hotplug, NVDIMMs *AND* TDX support, what happens?

You can't just ignore the problems because they're not present on one
version of the hardware.

>>> Another case is admin can use 'memmap' kernel command line to create
>>> legacy PMEMs and use them as TD guest memory, or theoretically, can use
>>> kmem driver to add them as system RAM. To avoid having to change memory
>>> hotplug code to prevent this from happening, this series always include
>>> legacy PMEMs when constructing TDMRs so they are also TDX memory.
>>>
>>> 4. CPU hotplug
>>>
>>> The first generation of TDX architecturally doesn't support ACPI CPU
>>> hotplug. All logical cpus are enabled by BIOS in MADT table. Also, the
>>> first generation of TDX-capable platforms don't support ACPI CPU hotplug
>>> either. Since this physically cannot happen, this series doesn't add any
>>> check in ACPI CPU hotplug code path to disable it.
>>>
>>> Also, only TDX module initialization requires all BIOS-enabled cpus are
>>> online. After the initialization, any logical cpu can be brought down
>>> and brought up to online again later. Therefore this series doesn't
>>> change logical CPU hotplug either.
>>>
>>> 5. TDX interaction with kexec()
>>>
>>> If TDX is ever enabled and/or used to run any TD guests, the cachelines
>>> of TDX private memory, including PAMTs, used by TDX module need to be
>>> flushed before transiting to the new kernel otherwise they may silently
>>> corrupt the new kernel. Similar to SME, this series flushes cache in
>>> stop_this_cpu().
>>
>> What does this have to do with kexec()? What's a PAMT?
>
> The point is the dirty cachelines of TDX private memory must be flushed
> otherwise they may slightly corrupt the new kexec()-ed kernel.
>
> Will use "TDX metadata" instead of "PAMT". The former has already been
> mentioned above.

Longer description for the patch itself:

TDX memory encryption is built on top of MKTME which uses physical
address aliases to designate encryption keys. This architecture is not
cache coherent. Software is responsible for flushing the CPU caches
when memory changes keys. When kexec()'ing, memory can be repurposed
from TDX use to non-TDX use, changing the effective encryption key.

Cover-letter-level description:

Just like SME, TDX hosts require special cache flushing before kexec().

>>> uninitialized state so it can be initialized again.
>>>
>>> This implies:
>>>
>>> - If the old kernel fails to initialize TDX, the new kernel cannot
>>> use TDX too unless the new kernel fixes the bug which leads to
>>> initialization failure in the old kernel and can resume from where
>>> the old kernel stops. This requires certain coordination between
>>> the two kernels.
>>
>> OK, but what does this *MEAN*?
>
> This means we need to extend the information which the old kernel passes to the
> new kernel. But I don't think it's feasible. I'll refine this kexec() section
> to make it more concise next version.
>
>>
>>> - If the old kernel has initialized TDX successfully, the new kernel
>>> may be able to use TDX if the two kernels have the exactly same
>>> configurations on the TDX module. It further requires the new kernel
>>> to reserve the TDX metadata pages (allocated by the old kernel) in
>>> its page allocator. It also requires coordination between the two
>>> kernels. Furthermore, if kexec() is done when there are active TD
>>> guests running, the new kernel cannot use TDX because it's extremely
>>> hard for the old kernel to pass all TDX private pages to the new
>>> kernel.
>>>
>>> Given that, this series doesn't support TDX after kexec() (except the
>>> old kernel doesn't attempt to initialize TDX at all).
>>>
>>> And this series doesn't shut down TDX module but leaves it open during
>>> kexec(). It is because shutting down TDX module requires CPU being in
>>> VMX operation but there's no guarantee of this during kexec(). Leaving
>>> the TDX module open is not the best case, but it is OK since the new
>>> kernel won't be able to use TDX anyway (therefore TDX module won't run
>>> at all).
>>
>> tl;dr: kexec() doesn't work with this code.
>>
>> Right?
>>
>> That doesn't seem good.
>
> It can work in my understanding. We just need to flush cache before booting to
> the new kernel.

What about all the concerns about TDX module configuration changing?