Re: [PATCH v3 00/21] TDX host kernel support

From: Kai Huang
Date: Tue Apr 26 2022 - 21:15:15 EST


On Tue, 2022-04-26 at 13:13 -0700, Dave Hansen wrote:
> On 4/5/22 21:49, Kai Huang wrote:
> > SEAM VMX root operation is designed to host a CPU-attested, software
> > module called the 'TDX module' which implements functions to manage
> > crypto protected VMs called Trust Domains (TD). SEAM VMX root is also
>
> "crypto protected"? What the heck is that?

How about "crypto-protected"? I googled and it seems it is used by someone
else.

>
> > designed to host a CPU-attested, software module called the 'Intel
> > Persistent SEAMLDR (Intel P-SEAMLDR)' to load and update the TDX module.
> >
> > Host kernel transits to either the P-SEAMLDR or the TDX module via a new
>
> ^ The

Thanks.

>
> > SEAMCALL instruction. SEAMCALLs are host-side interface functions
> > defined by the P-SEAMLDR and the TDX module around the new SEAMCALL
> > instruction. They are similar to a hypercall, except they are made by
> > host kernel to the SEAM software modules.
>
> This is still missing some important high-level things, like that the
> TDX module is protected from the untrusted VMM. Heck, it forgets to
> mention that the VMM itself is untrusted and the TDX module replaces
> things that the VMM usually does.
>
> It would also be nice to mention here how this compares with SEV-SNP.
> Where is the TDX module in that design? Why doesn't SEV need all this code?
>
> > TDX leverages Intel Multi-Key Total Memory Encryption (MKTME) to crypto
> > protect TD guests. TDX reserves part of MKTME KeyID space as TDX private
> > KeyIDs, which can only be used by software runs in SEAM. The physical
>
> ^ which

Thanks.

>
> > address bits for encoding TDX private KeyID are treated as reserved bits
> > when not in SEAM operation. The partitioning of MKTME KeyIDs and TDX
> > private KeyIDs is configured by BIOS.
> >
> > Before being able to manage TD guests, the TDX module must be loaded
> > and properly initialized using SEAMCALLs defined by TDX architecture.
> > This series assumes both the P-SEAMLDR and the TDX module are loaded by
> > BIOS before the kernel boots.
> >
> > There's no CPUID or MSR to detect either the P-SEAMLDR or the TDX module.
> > Instead, detecting them can be done by using P-SEAMLDR's SEAMLDR.INFO
> > SEAMCALL to detect P-SEAMLDR. The success of this SEAMCALL means the
> > P-SEAMLDR is loaded. The P-SEAMLDR information returned by this
> > SEAMCALL further tells whether TDX module is loaded.
>
> There's a bit of information missing here. The kernel might not know
> the state of things being loaded. A previous kernel might have loaded
> it and left it in an unknown state.
>
> > The TDX module is initialized in multiple steps:
> >
> > 1) Global initialization;
> > 2) Logical-CPU scope initialization;
> > 3) Enumerate the TDX module capabilities;
> > 4) Configure the TDX module about usable memory ranges and
> > global KeyID information;
> > 5) Package-scope configuration for the global KeyID;
> > 6) Initialize TDX metadata for usable memory ranges based on 4).
> >
> > Step 2) requires calling some SEAMCALL on all "BIOS-enabled" (in MADT
> > table) logical cpus, otherwise step 4) will fail. Step 5) requires
> > calling SEAMCALL on at least one cpu on all packages.
> >
> > TDX module can also be shut down at any time during module's lifetime, by
> > calling SEAMCALL on all "BIOS-enabled" logical cpus.
> >
> > == Design Considerations ==
> >
> > 1. Lazy TDX module initialization on-demand by caller
>
> This doesn't really tell us what "lazy" is or what the alternatives are.
>
> There are basically two ways the TDX module could be loaded. Either:
> * In early boot
> or
> * At runtime just before the first TDX guest is run
>
> This series implements the runtime loading.

OK will do.

>
> > None of the steps in the TDX module initialization process must be done
> > during kernel boot. This series doesn't initialize TDX at boot time, but
> > instead, provides two functions to allow caller to detect and initialize
> > TDX on demand:
> >
> > if (tdx_detect())
> > goto no_tdx;
> > if (tdx_init())
> > goto no_tdx;
> >
> > This approach has below pros:
> >
> > 1) Initializing the TDX module requires to reserve ~1/256th system RAM as
> > metadata. Enabling TDX on demand allows only to consume this memory when
> > TDX is truly needed (i.e. when KVM wants to create TD guests).
> >
> > 2) Both detecting and initializing the TDX module require calling
> > SEAMCALL. However, SEAMCALL requires CPU being already in VMX operation
> > (VMXON has been done). So far, KVM is the only user of TDX, and it
> > already handles VMXON/VMXOFF. Therefore, letting KVM to initialize TDX
> > on-demand avoids handling VMXON/VMXOFF (which is not that trivial) in
> > core-kernel. Also, in long term, likely a reference based VMXON/VMXOFF
> > approach is needed since more kernel components will need to handle
> > VMXON/VMXONFF.
> >
> > 3) It is more flexible to support "TDX module runtime update" (not in
> > this series). After updating to the new module at runtime, kernel needs
> > to go through the initialization process again. For the new module,
> > it's possible the metadata allocated for the old module cannot be reused
> > for the new module, and needs to be re-allocated again.
> >
> > 2. Kernel policy on TDX memory
> >
> > Host kernel is responsible for choosing which memory regions can be used
> > as TDX memory, and configuring those memory regions to the TDX module by
> > using an array of "TD Memory Regions" (TDMR), which is a data structure
> > defined by TDX architecture.
>
>
> This is putting the cart before the horse. Don't define the details up
> front.
>
> The TDX architecture allows the VMM to designate specific memory
> as usable for TDX private memory. This series chooses to
> designate _all_ system RAM as TDX to avoid having to modify the
> page allocator to distinguish TDX and non-TDX-capable memory
>
> ... then go on to explain the details.

Thanks. Will update.

>
> > The first generation of TDX essentially guarantees that all system RAM
> > memory regions (excluding the memory below 1MB) can be used as TDX
> > memory. To avoid having to modify the page allocator to distinguish TDX
> > and non-TDX allocation, this series chooses to use all system RAM as TDX
> > memory.
> >
> > E820 table is used to find all system RAM entries. Following
> > e820__memblock_setup(), both E820_TYPE_RAM and E820_TYPE_RESERVED_KERN
> > types are treated as TDX memory, and contiguous ranges in the same NUMA
> > node are merged together (similar to memblock_add()) before trimming the
> > non-page-aligned part.
>
> This e820 cruft is too much detail for a cover letter. In general, once
> you start talking about individual functions, you've gone too far in the
> cover letter.

Will remove.

>
> > 3. Memory hotplug
> >
> > The first generation of TDX architecturally doesn't support memory
> > hotplug. And the first generation of TDX-capable platforms don't support
> > physical memory hotplug. Since it physically cannot happen, this series
> > doesn't add any check in ACPI memory hotplug code path to disable it.
> >
> > A special case of memory hotplug is adding NVDIMM as system RAM using
> > kmem driver. However the first generation of TDX-capable platforms
> > cannot enable TDX and NVDIMM simultaneously, so in practice this cannot
> > happen either.
>
> What prevents this code from today's code being run on tomorrow's
> platforms and breaking these assumptions?

I forgot to add below (which is in the documentation patch):

"This can be enhanced when future generation of TDX starts to support ACPI
memory hotplug, or NVDIMM and TDX can be enabled simultaneously on the
same platform."

Is this acceptable?

>
> > Another case is admin can use 'memmap' kernel command line to create
> > legacy PMEMs and use them as TD guest memory, or theoretically, can use
> > kmem driver to add them as system RAM. To avoid having to change memory
> > hotplug code to prevent this from happening, this series always include
> > legacy PMEMs when constructing TDMRs so they are also TDX memory.
> >
> > 4. CPU hotplug
> >
> > The first generation of TDX architecturally doesn't support ACPI CPU
> > hotplug. All logical cpus are enabled by BIOS in MADT table. Also, the
> > first generation of TDX-capable platforms don't support ACPI CPU hotplug
> > either. Since this physically cannot happen, this series doesn't add any
> > check in ACPI CPU hotplug code path to disable it.
> >
> > Also, only TDX module initialization requires all BIOS-enabled cpus are
> > online. After the initialization, any logical cpu can be brought down
> > and brought up to online again later. Therefore this series doesn't
> > change logical CPU hotplug either.
> >
> > 5. TDX interaction with kexec()
> >
> > If TDX is ever enabled and/or used to run any TD guests, the cachelines
> > of TDX private memory, including PAMTs, used by TDX module need to be
> > flushed before transiting to the new kernel otherwise they may silently
> > corrupt the new kernel. Similar to SME, this series flushes cache in
> > stop_this_cpu().
>
> What does this have to do with kexec()? What's a PAMT?

The point is the dirty cachelines of TDX private memory must be flushed
otherwise they may slightly corrupt the new kexec()-ed kernel.

Will use "TDX metadata" instead of "PAMT". The former has already been
mentioned above.

>
> > The TDX module can be initialized only once during its lifetime. The
> > first generation of TDX doesn't have interface to reset TDX module to
>
> ^ an

Thanks.

>
> > uninitialized state so it can be initialized again.
> >
> > This implies:
> >
> > - If the old kernel fails to initialize TDX, the new kernel cannot
> > use TDX too unless the new kernel fixes the bug which leads to
> > initialization failure in the old kernel and can resume from where
> > the old kernel stops. This requires certain coordination between
> > the two kernels.
>
> OK, but what does this *MEAN*?

This means we need to extend the information which the old kernel passes to the
new kernel. But I don't think it's feasible. I'll refine this kexec() section
to make it more concise next version.

>
> > - If the old kernel has initialized TDX successfully, the new kernel
> > may be able to use TDX if the two kernels have the exactly same
> > configurations on the TDX module. It further requires the new kernel
> > to reserve the TDX metadata pages (allocated by the old kernel) in
> > its page allocator. It also requires coordination between the two
> > kernels. Furthermore, if kexec() is done when there are active TD
> > guests running, the new kernel cannot use TDX because it's extremely
> > hard for the old kernel to pass all TDX private pages to the new
> > kernel.
> >
> > Given that, this series doesn't support TDX after kexec() (except the
> > old kernel doesn't attempt to initialize TDX at all).
> >
> > And this series doesn't shut down TDX module but leaves it open during
> > kexec(). It is because shutting down TDX module requires CPU being in
> > VMX operation but there's no guarantee of this during kexec(). Leaving
> > the TDX module open is not the best case, but it is OK since the new
> > kernel won't be able to use TDX anyway (therefore TDX module won't run
> > at all).
>
> tl;dr: kexec() doesn't work with this code.
>
> Right?
>
> That doesn't seem good.

It can work in my understanding. We just need to flush cache before booting to
the new kernel.


--
Thanks,
-Kai