Re: [PATCH v8 8/9] x86/virt/tdx: Reduce TDMR's reserved areas by using CMRs to find memory holes

From: Huang, Kai
Date: Wed Dec 04 2024 - 09:22:49 EST


On Thu, 2024-11-14 at 00:57 +1300, Kai Huang wrote:
> A TDX module initialization failure was reported on a Emerald Rapids
> platform [*]:
>
> virt/tdx: initialization failed: TDMR [0x0, 0x80000000): reserved areas exhausted.
> virt/tdx: module initialization failed (-28)
>
> As part of initializing the TDX module, the kernel informs the TDX
> module of all "TDX-usable memory regions" using an array of TDX defined
> structure "TD Memory Region" (TDMR). Each TDMR must be in 1GB aligned
> and in 1GB granularity, and all "non-TDX-usable memory holes" within a
> given TDMR are marked as "reserved areas". The TDX module reports a
> maximum number of reserved areas that can be supported per TDMR (16).
>
> The kernel builds the "TDX-usable memory regions" based on memblocks
> (which reflects e820), and uses this list to find all "reserved areas"
> for each TDMR.
>
> It turns out that the kernel's view of memory holes is too fine grained
> and sometimes exceeds the number of holes that the TDX module can track
> per TDMR [1], resulting in the above failure.
>
> Thankfully the module also lists memory that is potentially convertible
> in a list of "Convertible Memory Regions" (CMRs). That coarser grained
> CMR list tends to track usable memory in the memory map even if it might
> be reserved for host usage like 'ACPI data' [2].
>
> Use that list to relax what the kernel considers unusable memory. If it
> falls in a CMR no need to instantiate a hole, and rely on the fact that
> kernel will keep what it considers 'reserved' out of the page allocator.
>

Hi Dave,

I realized there's one issue if we change to use CMRs to fill up reserved areas
for TDMRs after some internal discussion with Dan:

Currently we requires all memory regions in memblocks.memory to be TDX
convertible memory at the time of initializing the TDX module to make module
initialization successful. We build a list of those memory regions as a list of
"TDX memory blocks", and use them to construct the TDMRs to configure the
module. But we don't explicitly check those memory regions against CMRs to make
sure they are truly TDX convertible, instead we depend on TDH.SYS.CONFIG to
catch any non-CMR memory regions that end up to the TDX memory blocks.

This works fine because currently we use those TDX memory blocks to fill up the
reserved areas of TDMRs, i.e., any non-CMR regions in TDX memory blocks will end
up to "non-reserved areas" in TDMR(s), and this will cause TDH.SYS.CONFIG to
fail.

After we change to using CMRs to fill up reserved areas of TDMRs, then all non-
CMR regions will be marked as "reserved areas", regardless whether there is any
non-CMR memory region in the TDX memory blocks. This will result in TDX module
initialization being successful while there are non-CMR pages going into page
allocator.

To avoid this, we need to explicitly check all the TDX memory blocks against
CMRs to make sure they are actually TDX convertible, before using CMRs to fill
up reserved areas.

I ended up with below incremental diff with some additional text for the
changelog.

Do you have any comments?

------

Currently, the kernel does not explicitly check all TDX-usable memory
regions (that come from memblock) to be truly TDX convertible (not
feasible w/o CMRs anyway) but depends on the TDH.SYS.CONFIG to fail if
there's any non-CMR memory region ended up as TDX-usable memory.

After changing to using CMRs to fill up reserved areas, unfortunately
the TDH.SYS.CONFIG will no longer be able to catch such non-CMR regions.
This is because any non-CMR region will always end up with reserved
area even if it is included in TDX-usable memory regions.

To still make sure of that, explicitly check all TDX-usable memory
regions are truly TDX convertible against CMRs when stashing them from
the memblocks.

Also print an error message if any region is non-CMR so the users can
easily know the reason of the failure. This is nice to have anyway
because a clear message is better than having to decipher the error code
of the TDH.SYS.CONFIG.

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index e4a7e0e17812..d25bdf8ca136 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -176,6 +176,23 @@ int tdx_cpu_enable(void)
}
EXPORT_SYMBOL_GPL(tdx_cpu_enable);

+/* Check whether a given memory region is a sub-region of any CMR. */
+static bool is_cmr_sub_region(unsigned long start_pfn, unsigned long end_pfn,
+ struct tdx_sys_info_cmr *sysinfo_cmr)
+{
+ int i;
+
+ for (i = 0; i < sysinfo_cmr->num_cmrs; i++) {
+ u64 cmr_base_pfn = sysinfo_cmr->cmr_base[i] >> PAGE_SHIFT;
+ u64 cmr_npages = sysinfo_cmr->cmr_size[i] >> PAGE_SHIFT;
+
+ if (start_pfn >= cmr_base_pfn &&
+ end_pfn <= (cmr_base_pfn + cmr_npages))
+ return true;
+ }
+
+ return false;
+}
/*
* Add a memory region as a TDX memory block. The caller must make sure
* all memory regions are added in address ascending order and don't
@@ -218,7 +235,8 @@ static void free_tdx_memlist(struct list_head *tmb_list)
* ranges off in a secondary structure because memblock is modified
* in memory hotplug while TDX memory regions are fixed.
*/
-static int build_tdx_memlist(struct list_head *tmb_list)
+static int build_tdx_memlist(struct list_head *tmb_list,
+ struct tdx_sys_info_cmr *sysinfo_cmr)
{
unsigned long start_pfn, end_pfn;
int i, nid, ret;
@@ -234,6 +252,27 @@ static int build_tdx_memlist(struct list_head *tmb_list)
if (start_pfn >= end_pfn)
continue;

+ /*
+ * Make sure the to-be-added memory region is truly TDX
+ * convertible memory.
+ *
+ * Note:
+ *
+ * The to-be-added memory region here doesn't cross NUMA
+ * nodes. The check here assumes the memory region does
+ * not cross any adjacent CMRs which are contiguous
+ * (i.e., the end of the first CMR is the start of the
+ * next) _AND_ are in the same NUMA node. A sane BIOS
+ * should never report memory regions and CMRs in such
+ * way.
+ */
+ if (!is_cmr_sub_region(start_pfn, end_pfn, sysinfo_cmr)) {
+ pr_err("memory region [0x%lx, 0x%lx) is not TDX
convertible memory.\n",
+ PHYS_PFN(start_pfn), PHYS_PFN(end_pfn));
+ ret = -EINVAL;
+ goto err;
+ }
+
/*
* Add the memory regions as TDX memory. The regions in
* memblock has already guaranteed they are in address
@@ -940,6 +979,15 @@ static int construct_tdmrs(struct list_head *tmb_list,
if (ret)
return ret;

+ /*
+ * Use CMRs instead of the TDX memory blocks to populate the
+ * reserved areas to reduce consumption of reserved areas for
+ * each TDMR. On some large systems (e.g., a machine with 4 or
+ * more sockets), the BIOS may report many usable memory regions
+ * in the first 1GB in e820. This may result in reserved areas
+ * of the first TDMR being exhausted if TDX memory blocks are
+ * used to fill up reserved areas.
+ */
ret = tdmrs_populate_rsvd_areas_all(tdmr_list, sysinfo_cmr,
sysinfo_tdmr->max_reserved_per_tdmr);
if (ret)
@@ -1124,7 +1172,7 @@ static int init_tdx_module(void)
*/
get_online_mems();

- ret = build_tdx_memlist(&tdx_memlist);
+ ret = build_tdx_memlist(&tdx_memlist, &sysinfo.cmr);
if (ret)
goto out_put_tdxmem;