Alexander Graf <graf@xxxxxxxxxx> writes:
Kexec today considers itself purely a boot loader: When we enter the newWhat you are describe in many ways is the same problem as
kernel, any state the previous kernel left behind is irrelevant and the
new kernel reinitializes the system.
However, there are use cases where this mode of operation is not what we
actually want. In virtualization hosts for example, we want to use kexec
to update the host kernel while virtual machine memory stays untouched.
When we add device assignment to the mix, we also need to ensure that
IOMMU and VFIO states are untouched. If we add PCIe peer to peer DMA, we
need to do the same for the PCI subsystem. If we want to kexec while an
SEV-SNP enabled virtual machine is running, we need to preserve the VM
context pages and physical memory. See James' and my Linux Plumbers
Conference 2023 presentation for details:
https://lpc.events/event/17/contributions/1485/
To start us on the journey to support all the use cases above, this
patch implements basic infrastructure to allow hand over of kernel state
across kexec (Kexec HandOver, aka KHO). As example target, we use ftrace:
With this patch set applied, you can read ftrace records from the
pre-kexec environment in your post-kexec one. This creates a very powerful
debugging and performance analysis tool for kexec. It's also slightly
easier to reason about than full blown VFIO state preservation.
== Alternatives ==
There are alternative approaches to (parts of) the problems above:
* Memory Pools [1] - preallocated persistent memory region + allocator
* PRMEM [2] - resizable persistent memory regions with fixed metadata
pointer on the kernel command line + allocator
* Pkernfs [3] - preallocated file system for in-kernel data with fixed
address location on the kernel command line
* PKRAM [4] - handover of user space pages using a fixed metadata page
specified via command line
All of the approaches above fundamentally have the same problem: They
require the administrator to explicitly carve out a physical memory
location because they have no mechanism outside of the kernel command
line to pass data (including memory reservations) between kexec'ing
kernels.
KHO provides that base foundation. We will determine later whether we
still need any of the approaches above for fast bulk memory handover of for
example IOMMU page tables. But IMHO they would all be users of KHO, with
KHO providing the foundational primitive to pass metadata and bulk memory
reservations as well as provide easy versioning for data.
kexec-on-panic. The goal of leaving devices running absolutely requires
carving out memory for the new kernel to live in while it is coming up
so that DMA from a device that was not shutdown down does not stomp the
kernel coming up.
If I understand the virtualization case some of those virtual machines
are going to have virtual NICs that are going to want to DMA memory to
the host system. Which if I understand things correctly means that
among the devices you explicitly want to keep running there is a not
a way to avoid the chance of DMA coming in while the kernel is being
changed.
There is also a huge maintenance challenge associated with all of this.
If you go with something that is essentially kexec-on-panic and then
add a little bit to help find things in the memory of the previous
kernel while the new kernel is coming up I can see it as a possibility.
As an example I think preserving ftrace data of kexec seems bizarre.
I don't see how that is an interesting use case at all. Not in
the situation of preserving virtual machines, and not in the situation
of kexec on panic.
If you are doing an orderly shutdown and kernel switch you should be
able to manually change the memory. If you are not doing an orderly
shutdown then I really don't get it.
I don't hate the capability you are trying to build.
I have not read or looked at most of this so I am probably
missing subtle details.
As you are currently describing things I have the sense you have
completely misframed the problem and are trying to solve the wrong parts
of the problem.