[PATCH v2 15/18] kho: extend scratch
From: Pratyush Yadav
Date: Fri Jun 05 2026 - 14:46:49 EST
From: "Pratyush Yadav (Google)" <pratyush@xxxxxxxxxx>
Motivation
==========
The scratch space is allocated by the first kernel in the KHO chain, and
is reused by all subsequent kernels. The size of the space is either set
via the commandline by the system administrator or by calculating the
amount of memory used by the kernel and adding a multiplier. In either
case, the scratch space is a heuristic and is liable to fill up and fail
allocation if a kernel uses more memory than expected.
In addition, gigantic huge pages (usually 1 GiB) are allocated via
memblock, and in a KHO boot that memory comes from the scratch space. In
hypervisors it is common to dedicate a major part of the system's memory
to gigantic hugepages for VM memory.
If this memory needs to come from scratch space, then scratch needs to
be greater than the memory needed for huge pages, which is impractical.
In addition, hugepages can be preserved memory. Allocating them from
scratch violates the assumption that scratch contains no preserved
memory.
Methodology
===========
Discover areas that don't contain any preserved memory at boot by
walking the preserved memory radix tree. Mark them as scratch to allow
allocations from them. This makes KHO more resilient to memory pressure
and allows supporting huge page preservation.
Since the preserved memory radix tree mixes both physical address and
order into a single key, and does not track table pages, it is difficult
to identify free areas from it directly. Walk the tree and digest it
down into another radix tree. The latter tracks blocks of
KHO_EXT_SHIFT (1 GiB as of now) granularity. Then walk the digested tree
and mark the areas between the present keys as scratch.
Performance
===========
The discovery algorithm traverses the preserved memory radix tree
exactly once. While it does use memory for the digested radix tree,
since the blocks are split by 1 GiB, a single bitmap with 4k pages can
track up to 32 TiB of memory. So there are likely to be very few radix
tree pages used in this tracking. For systems with all physical memory
below 32 TiB, this should result in a total of 6 pages being
used (KHO_TREE_MAX_DEPTH == 6).
An alternate way of achieving this would be to call kho_mem_retrieve()
earlier in boot and mark all the KHO preservations as reserved. But that
can blow up memblock.reserved with a bunch of 4K pages scattered
everywhere, which will reduce performance of subsequent allocations.
Since the free blocks are tracked in chunks of 1 GiB, this won't blow up
memblock.memory as much.
There is no inherent reason for using 1 GiB as the discovered block
size. This can be changed later if needed. Currently, KHO is mainly
targeted for server grade systems with hundreds of gigabytes to
terabytes of memory. So 1 GiB is a reasonable granularity for those
systems. For smaller systems this doesn't work as well, but we can
arrive at a better heuristic when we have concrete use cases.
Practical evaluation
====================
The testing is done on a x86_64 qemu VM running under KVM with 64G
memory and 12 CPUs. The machine pre-allocates 50 1G pages.
Since the performance scales with how busy the radix tree is, tests are
done with 2 preservation patterns: first with two 1M memfds, second with
two 1G memfds, both using 4k pages.
Test case 1 - 1M memfd
~~~~~~~~~~~~~~~~~~~~~~
This test case has two memfds with 1M memory each in 4k pages, plus
other preservations from LUO core and other KHO users.
This is how the radix tree stats look like:
radix_nodes: 0x13
nr_preservations: 0x214
mem_preserved: 0x227000
per order preservations:
order 0: 0x20f
order 1: 0x4
order 4: 0x1
and this is how long it takes to extend the scratch after KHO boot:
KHO: KHO extend time: 47 us
KHO: KHO extend total mem: 0xe6c17b000 (~57G)
Test case 2 - 1G memfd
~~~~~~~~~~~~~~~~~~~~~~
This test case has two memfds with 1G memory each in 4k pages, plus
other preservations from LUO core and other KHO users.
This is how the radix tree stats look like:
radix_nodes: 0x28
nr_preservations: 0x80816
mem_preserved: 0x80829000
per order preservations:
order 0: 0x80811
order 1: 0x4
order 4: 0x1
and this is how long it takes to extend the scratch after KHO boot:
KHO: KHO extend time: 22514 us
KHO: KHO extend total mem: 0xd3f200000 (~52G)
Signed-off-by: Pratyush Yadav (Google) <pratyush@xxxxxxxxxx>
---
Notes:
As one might notice, the "scratch" terminology starts to break down
here. There is the original "scratch", which is passed down by the
previous kernel. It is marked MEMBLOCK_KHO_SCRATCH. There is also the
discovered "scratch", which also gets marked MEMBLOCK_KHO_SCRATCH, but
has nothing to do with the former.
For limiting the scope of this series, I haven't done the rename here. I
can do it as a follow up series once this stabilizes and lands into
-next.
I suggest the following scheme:
- Rename "KHO scratch" to "KHO bootmem". Update the documentation and
all code to use this name. We have the kho_scratch kernel cmdline
parameter, which is harder to change, but perhaps we can rename it to
"kho_bootmem" and if someone complains we can add it back.
- Rename MEMBLOCK_KHO_SCRATCH to MEMBLOCK_KHO_NOPRSRV. This describes
the property of the memory not its origin. Then KHO can mark its
"bootmem" as KHO_NOPRSRV because bootmem never has any preserved
memory. Later, kho_extend_scratch() (which is also due for a better
name) can also mark its discovered areas as KHO_NOPRSRV.
kernel/liveupdate/kexec_handover.c | 149 +++++++++++++++++++++++++----
1 file changed, 132 insertions(+), 17 deletions(-)
diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c
index af22086ca2d6..8540608b8602 100644
--- a/kernel/liveupdate/kexec_handover.c
+++ b/kernel/liveupdate/kexec_handover.c
@@ -84,6 +84,23 @@ static struct kho_out kho_out = {
},
};
+struct kho_in {
+ phys_addr_t fdt_phys;
+ phys_addr_t scratch_phys;
+ char previous_release[__NEW_UTS_LEN + 1];
+ u32 kexec_count;
+ struct kho_debugfs dbg;
+ struct kho_radix_tree radix_tree;
+};
+
+static struct kho_in kho_in = {
+};
+
+static const void *kho_get_fdt(void)
+{
+ return kho_in.fdt_phys ? phys_to_virt(kho_in.fdt_phys) : NULL;
+}
+
/**
* kho_encode_radix_key - Encodes a physical address and order into a radix key.
* @phys: The physical address of the page.
@@ -869,6 +886,119 @@ static void __init kho_reserve_scratch(void)
kho_enable = false;
}
+#define KHO_EXT_SHIFT 30 /* 1 GiB */
+
+static int __init kho_ext_walk_key(unsigned long key, void *data)
+{
+ struct kho_radix_tree *tree = data;
+ phys_addr_t start, end;
+ unsigned int order;
+ int err;
+
+ start = kho_decode_radix_key(key, &order);
+ end = start + (1UL << (order + PAGE_SHIFT));
+
+ while (start < end) {
+ err = kho_radix_add_key(tree, start >> KHO_EXT_SHIFT);
+ if (err)
+ return err;
+
+ start += (1UL << KHO_EXT_SHIFT);
+ }
+
+ return 0;
+}
+
+static int __init kho_ext_walk_node(phys_addr_t phys, void *data)
+{
+ struct kho_radix_tree *tree = data;
+
+ return kho_radix_add_key(tree, phys >> KHO_EXT_SHIFT);
+}
+
+static int __init kho_ext_mark_scratch(unsigned long key, void *data)
+{
+ phys_addr_t *prev_end = data;
+ phys_addr_t start = key << KHO_EXT_SHIFT;
+ int err;
+
+ if (start > *prev_end) {
+ err = memblock_mark_kho_scratch(*prev_end, start - *prev_end);
+ if (err)
+ return err;
+ }
+
+ *prev_end = start + (1UL << KHO_EXT_SHIFT);
+ return 0;
+}
+
+/**
+ * kho_extend_scratch - Extend the scratch regions
+ *
+ * The KHO radix tree mixes both physical address and order into a single key.
+ * This makes it hard to look for free ranges directly. This function first
+ * walks the radix tree and digests it down into another radix tree, whose keys
+ * identify blocks of KHO_EXT_SHIFT which contain preserved memory.
+ *
+ * Then it walks the digested radix tree and marks everything that doesn't have
+ * preserved memory as scratch.
+ *
+ * NOTE: This function allocates memory so it should be called when scratch has
+ * available space.
+ *
+ * NOTE: The pages of the KHO radix tree tables are not marked as preserved in
+ * the KHO tree. But they are expected to remain untouched until the tree is
+ * fully parsed. So this function also considers them to be "preserved memory"
+ * and marks their blocks as busy.
+ */
+static void __init kho_extend_scratch(void)
+{
+ const struct kho_radix_walk_cb kho_cb = {
+ .leaf = kho_ext_walk_key,
+ .node = kho_ext_walk_node,
+ };
+ const struct kho_radix_walk_cb ext_cb = {
+ .leaf = kho_ext_mark_scratch,
+ };
+ struct kho_radix_tree radix;
+ phys_addr_t prev_end = 0;
+ int err = 0;
+
+ if (!is_kho_boot())
+ return;
+
+ /* Make sure the KHO radix tree is initialized. */
+ err = kho_radix_init_tree(&kho_in.radix_tree,
+ kho_get_mem_map(kho_get_fdt()));
+ if (err)
+ goto print;
+
+ err = kho_radix_init_tree(&radix, NULL);
+ if (err)
+ goto print;
+
+ /* Walk the KHO radix tree to find busy blocks. */
+ err = kho_radix_walk_tree(&kho_in.radix_tree, &kho_cb, &radix);
+ if (err)
+ goto out;
+
+ /* Walk the blocks and mark everything between keys as scratch. */
+ err = kho_radix_walk_tree(&radix, &ext_cb, &prev_end);
+ if (err)
+ goto out;
+
+ /* Mark everything from last busy block to end of DRAM. */
+ if (prev_end < memblock_end_of_DRAM())
+ err = memblock_mark_kho_scratch(prev_end, memblock_end_of_DRAM() - prev_end);
+
+ /* fallthrough */
+out:
+ kho_radix_destroy_tree(&radix);
+print:
+ if (err)
+ pr_err("Failed to extend scratch: %pe\n", ERR_PTR(err));
+}
+
/**
* kho_add_subtree - record the physical address of a sub blob in KHO root tree.
* @name: name of the sub tree.
@@ -1443,23 +1573,6 @@ void kho_restore_free(void *mem)
}
EXPORT_SYMBOL_GPL(kho_restore_free);
-struct kho_in {
- phys_addr_t fdt_phys;
- phys_addr_t scratch_phys;
- char previous_release[__NEW_UTS_LEN + 1];
- u32 kexec_count;
- struct kho_debugfs dbg;
- struct kho_radix_tree radix_tree;
-};
-
-static struct kho_in kho_in = {
-};
-
-static const void *kho_get_fdt(void)
-{
- return kho_in.fdt_phys ? phys_to_virt(kho_in.fdt_phys) : NULL;
-}
-
/**
* is_kho_boot - check if current kernel was booted via KHO-enabled
* kexec
@@ -1763,6 +1876,8 @@ void __init kho_memory_init_early(void)
*/
if (kho_in.scratch_phys)
kho_scratch = phys_to_virt(kho_in.scratch_phys);
+
+ kho_extend_scratch();
}
void __init kho_memory_init(void)
--
2.54.0.1032.g2f8565e1d1-goog