Re: [RFC] proposal: KVM: Orphaned VMs: The Caretaker approach for Live Update

From: Paolo Bonzini

Date: Thu Apr 30 2026 - 09:30:00 EST

I have some very similar observations to Alex and some very similar observations to David. This has to imply that everyone will agree with me. :)

Seriously, the main contention point, from reading the thread, is the placement and lifecycle of the caretaker. More on this later...

On 4/29/26 00:29, Pasha Tatashin wrote:

While this proposal focuses on its critical role in minimally disruptive
Live Update, the Caretaker is fundamentally designed as an extensible
primitive. Its architecture allows it to be leveraged for a variety of
other advanced virtualization use cases, such as running custom
lightweight hypervisors or completely offloading virtualization duties
to an accelerator card.

One step at a time please---and as an initial step, just place it inside the kernel, a la Arm nVHE.

Since your design would have anyway the ability to update the caretaker, you can embed that part into the reattachment process, so that the new kernel can use its own caretaker.

This reduces a lot the need to establish a stable-ish ABI. Only the handover (kexec/LUO) needs to be stable, so that the new kernel can populate its kvm and kvm_vcpu structs. And for that we mostly have a solution already: a stream of serialized ioctls.

During the execution of the KVM_SET_CARETAKER ioctl, instead of
pointing the hardware's return path to standard KVM entry points (e.g.,
vmx_vmexit or svm_vcpu_run), KVM reprograms the host-state return area
of the CPU's hardware virtualization control structures (e.g., Intel
VMCS, AMD VMCB, or ARM equivalent) to point directly into the
bare-metal Caretaker environment.

This can be done unconditionally for all VMs based on a module parameter, again as in Arm nVHE.

Note on Optimization vs. Security: Constantly switching the page table
(CR3) on every VM Exit can be expensive due to TLB flushing. To
optimize performance, the Caretaker can share the host kernel's page
tables while the kernel is still around, and dynamically replace
HOST_CR3 with the dedicated, isolated page tables only when the vCPU is
orphaned (during the detachment phase). On the other hand, maintaining
a permanently isolated CR3 for the Caretaker adds a strong security
boundary, achieving hardware-enforced separation similar to KVM Address
Space Isolation (ASI).

Agreed on this.

The Caretaker requires a defined ABI to communicate with the host KVM
subsystem. This ABI is implemented via the shared, identity-mapped .ccb
section of the ELF payload, acting as the Caretaker Control Block
(CCB).

The CCB acts as the source of truth for the Caretaker's execution loop
and contains three primary elements:

* Attachment State Flag: An atomic variable indicating the current
relationship with the host KVM subsystem (e.g., KVM_ATTACHED or
KVM_DETACHED).

This must be done atomically at the time Linux offlines/onlines a pCPU. The interface from Linux to the caretaker must use some kind of IPI so that the new kernel can force a VMEXIT (if needed) in the caretaker, ask it to serialize the vm state, and pass it down to the new kernel's caretaker.

* KVM Routing Pointers: The physical function pointers that the
Caretaker uses to safely jump into the host KVM's standard VM Exit
handlers when operating in normal mode.
* Shared Configuration Metadata: A physical pointer to dedicated
memory pages used by the kernel to share dynamic vCPU configuration
data with the Caretaker. Because every guest is configured
differently, KVM populates these pages with the specific parameters
negotiated during VM initialization (such as CPUID feature masks,
APIC routing, and timer states). These pages also include a
pre-allocated Telemetry Buffer for the Caretaker to log VM Exits
and spin-wait durations. These dedicated pages are explicitly
preserved across the host reboot via KHO, ensuring the Caretaker
maintains continuous access to the exact context required to
accurately emulate trivial exits during the gap.

All this is mostly unnecessary if the caretaker is provided by the kernel. The recently introduced remote ring buffers can be used for tracing too.

The Caretaker first evaluates the VM Exit reason. If the exit belongs to
a category that the Caretaker is programmed to resolve natively, it
handles it internally. For example, profiling of guests has identified
the following exit categories for potential local resolution:

* Guest Idle Exits (e.g., HLT): When the guest OS goes idle, it
triggers idle exits. The Caretaker intercepts these and halts the
physical core until the next guest-bound interrupt fires, preserving
host power.

I don't think HLT can be handled entirely here. Either you skip the exit completely or you have to go out to the scheduler. The HLT exit could be skipped unconditionally for an orphaned VM, but while there is a running kernel the caretaker has to run entirely with interrupts off and that limits what you can do.

In fact there is already a blueprint of what can be handled easily in the caretaker, namely vmx_exit_handlers_fastpath()/svm_exit_handlers_fastpath(). Stick to what exists already.

* Timer and APIC Exits: Even an idle guest frequently writes to
interrupt controllers and system registers to configure internal
timers. The Caretaker handles these trivial writes directly,
acknowledging the timer updates.

This depends heavily on the implementation of the hypervisor, for example it can be done on Intel via the preemption timer but not on AMD where an actual hrtimer is needed.

[...]

When the new VMM process spawns, it retrieves the
preserved session and issues LIVEUPDATE_SESSION_RETRIEVE_FD using
its token. LUO invokes KVM's .retrieve() callback to map the
preserved vcpufd back into the new VMM's file descriptor table. As
part of this retrieval process, the host formally brings the
isolated pCPU back online, and the new VMM userspace thread is
attached back to the active VM thread running on the vCPU. Finally,
KVM populates the new KVM Routing Pointers in the CCB and
atomically flips the Host State Flag back to KVM_ATTACHED. This
breaks the Caretaker's spin-wait loop (if it is in this state),
allowing standard KVM operation to resume.

This would also include some kind of serialization of the old VM into the new kernel's struct kvm_vcpu.

Also some kind of feature negotiation is needed (if that fails, the VMs are terminated unceremoniously) so I believe that the transition into and out of the gap must be synchronous. For example with INIT/SIPI for the entry, and an IPI for the exit?

Guest-to-Guest IPIs
-------------------

* The Problem: If the guest OS attempts to wake up a sleeping thread,
one orphaned vCPU will send an Inter-Processor Interrupt (IPI) to
another orphaned vCPU. In standard virtualization without hardware
assistance, writing to the APIC ICR (or sending an ARM SGI) causes
a VM Exit so the host KVM can emulate the message delivery. During
the gap, KVM is unavailable to route this message.

* Proposed Solution: The architecture may leverage hardware virtualized
interrupts (Intel APICv, AMD AVIC, or ARM GICv4.1 virtual SGIs).
This allows the hardware silicon to handle IPI delivery between the
isolated pCPUs natively, eliminating the VM Exit. Alternatively,
the Caretaker can be programmed to emulate the IPI delivery. By
utilizing the shared memory metadata, the Caretaker can determine
the target vCPU and directly update its pending interrupt state.

Yeah, I think APIC emulation to some extent must be moved into the VMX/SVM fastpaths. The good news is that this can be done already as a PoC without needing the whole caretaker and LUO infrastructure.

* The Problem: What happens if a Non-Maskable Interrupt (NMI), a
hardware timer tick, or a Machine Check Exception / System Error
(MCE / ARM SError) arrives while the CPU is actively executing
Caretaker code in KVM_DETACHED mode?

* Proposed Solution: To safely handle these asynchronous events, [...]
on x86, when transitioning into the gap, KVM explicitly programs
HOST_IDTR and HOST_GDTR to [the caretaker's] tables.

Agreed and this also shows that the transition must be synchronous.

* The Problem: As the guest executes, it may attempt to access memory
that has not yet been mapped by the hypervisor, or it may interact
with MMIO regions. Normally, this triggers an EPT Violation (Intel)
or NPT Page Fault (AMD), prompting KVM to allocate host pages and
update the secondary page tables. How are these updates handled
when the host KVM subsystem is offline during the gap?

* Proposed Solution: During the "Management Gap," there are absolutely
no updates made to the NPT/EPT. The existing secondary page tables
are fully preserved in memory via LUO kvmfd preservation prior to
detachment, allowing the guest to seamlessly access all previously
mapped memory. If the guest triggers a new page fault (requiring an
NPT/EPT update) during the gap, the Caretaker simply categorizes it
as a Blocking Exit.

Yes, by default everything is a blocking exit. In particular, unless one day we do x86/pKVM, page tables can be handled entirely by Linux rather than the caretaker with no change to the existing MMU notifier architecture.

As a consequence, the caretaker is absolutely not going to be a TCB---at least not in the beginning.

Compromised Caretaker
---------------------

* The Problem: The Caretaker runs in Host Mode. If left unprotected,
this could allow a lightly privileged userspace process (e.g., QEMU
or crosvm) to inject arbitrary executable code directly into the
CPU's most privileged hardware state (VMX Root / Ring 0 / EL2).

* Proposed Solution: To mitigate this risk, the KVM_SET_CARETAKER
ioctl may adopt the security model used by the kexec_file_load()
syscall. Rather than trusting userspace to pass physical addresses,
the kernel must take full ownership of payload validation:

-EOVERENGINEERED. Just shove it into the kernel.

Caretaker Update
----------------

* The Problem: Given that the Caretaker is permanently installed
during VM setup, how does it get updated on long-running VMs?

Via kexec. :) I understand you have bigger plans, but we need to crawl before walk^Wattempting a marathon.

I even wonder if, for long term simplicity, the interface for host->caretaker should be just for the caretaker to swallow the host into non-root mode, again as in Arm nVHE. That would make it much harder to implement some kind of live update, but my answer to that *really* is just to use kexec.

Paolo