Re: [PATCH v14 14/44] arm64: RMI: Basic infrastructure for creating a realm.
From: Marc Zyngier
Date: Thu May 28 2026 - 03:10:20 EST
On Wed, 13 May 2026 14:17:22 +0100,
Steven Price <steven.price@xxxxxxx> wrote:
>
> Introduce the skeleton functions for creating and destroying a realm.
> The IPA size requested is checked against what the RMM supports.
>
> The actual work of constructing the realm will be added in future
> patches.
Again, $SUBJECT doesn't reflect that this is purely a KVM patch.
>
> Signed-off-by: Steven Price <steven.price@xxxxxxx>
> ---
> Changes since v13:
> * Rebased and updated to RMM-v2.0-bet1.
> * Auxiliary granules have been removed in RMM-v2.0-bet1
> Changes since v12:
> * Drop the RMM_PAGE_{SHIFT,SIZE} defines - the RMM is now configured to
> be the same as the host's page size.
> * Rework delegate/undelegate functions to use the new RMI range based
> operations.
> Changes since v11:
> * Major rework to drop the realm configuration and make the
> construction of realms implicit rather than driven by the VMM
> directly.
> * The code to create RDs, handle VMIDs etc is moved to later patches.
> Changes since v10:
> * Rename from RME to RMI.
> * Move the stage2 cleanup to a later patch.
> Changes since v9:
> * Avoid walking the stage 2 page tables when destroying the realm -
> the real ones are not accessible to the non-secure world, and the RMM
> may leave junk in the physical pages when returning them.
> * Fix an error path in realm_create_rd() to actually return an error value.
> Changes since v8:
> * Fix free_delegated_granule() to not call kvm_account_pgtable_pages();
> a separate wrapper will be introduced in a later patch to deal with
> RTTs.
> * Minor code cleanups following review.
> Changes since v7:
> * Minor code cleanup following Gavin's review.
> Changes since v6:
> * Separate RMM RTT calculations from host PAGE_SIZE. This allows the
> host page size to be larger than 4k while still communicating with an
> RMM which uses 4k granules.
> Changes since v5:
> * Introduce free_delegated_granule() to replace many
> undelegate/free_page() instances and centralise the comment on
> leaking when the undelegate fails.
> * Several other minor improvements suggested by reviews - thanks for
> the feedback!
> Changes since v2:
> * Improved commit description.
> * Improved return failures for rmi_check_version().
> * Clear contents of PGD after it has been undelegated in case the RMM
> left stale data.
> * Minor changes to reflect changes in previous patches.
> ---
> arch/arm64/include/asm/kvm_emulate.h | 29 ++++++++++++++
> arch/arm64/include/asm/kvm_rmi.h | 51 +++++++++++++++++++++++++
> arch/arm64/kvm/arm.c | 12 ++++++
> arch/arm64/kvm/mmu.c | 12 +++++-
> arch/arm64/kvm/rmi.c | 57 ++++++++++++++++++++++++++++
> 5 files changed, 159 insertions(+), 2 deletions(-)
>
> diff --git a/arch/arm64/include/asm/kvm_emulate.h b/arch/arm64/include/asm/kvm_emulate.h
> index 5bf3d7e1d92c..82fd777bd9bb 100644
> --- a/arch/arm64/include/asm/kvm_emulate.h
> +++ b/arch/arm64/include/asm/kvm_emulate.h
> @@ -688,4 +688,33 @@ static inline void vcpu_set_hcrx(struct kvm_vcpu *vcpu)
> vcpu->arch.hcrx_el2 |= HCRX_EL2_EnASR;
> }
> }
> +
> +static inline bool kvm_is_realm(struct kvm *kvm)
> +{
> + if (static_branch_unlikely(&kvm_rmi_is_available))
> + return kvm->arch.is_realm;
> + return false;
> +}
> +
> +static inline enum realm_state kvm_realm_state(struct kvm *kvm)
> +{
> + return READ_ONCE(kvm->arch.realm.state);
> +}
> +
> +static inline void kvm_set_realm_state(struct kvm *kvm,
> + enum realm_state new_state)
> +{
> + WRITE_ONCE(kvm->arch.realm.state, new_state);
> +}
> +
> +static inline bool kvm_realm_is_created(struct kvm *kvm)
> +{
> + return kvm_is_realm(kvm) && kvm_realm_state(kvm) != REALM_STATE_NONE;
> +}
> +
> +static inline bool vcpu_is_rec(const struct kvm_vcpu *vcpu)
> +{
> + return false;
> +}
> +
> #endif /* __ARM64_KVM_EMULATE_H__ */
> diff --git a/arch/arm64/include/asm/kvm_rmi.h b/arch/arm64/include/asm/kvm_rmi.h
> index 4936007947fd..9de34983ee52 100644
> --- a/arch/arm64/include/asm/kvm_rmi.h
> +++ b/arch/arm64/include/asm/kvm_rmi.h
> @@ -6,12 +6,63 @@
> #ifndef __ASM_KVM_RMI_H
> #define __ASM_KVM_RMI_H
>
> +#include <asm/rmi_smc.h>
> +
> +/**
> + * enum realm_state - State of a Realm
> + */
> +enum realm_state {
> + /**
> + * @REALM_STATE_NONE:
> + * Realm has not yet been created. rmi_realm_create() has not
> + * yet been called.
> + */
> + REALM_STATE_NONE,
> + /**
> + * @REALM_STATE_NEW:
> + * Realm is under construction, rmi_realm_create() has been
> + * called, but it is not yet activated. Pages may be populated.
> + */
> + REALM_STATE_NEW,
> + /**
> + * @REALM_STATE_ACTIVE:
> + * Realm has been created and is eligible for execution with
> + * rmi_rec_enter(). Pages may no longer be populated with
> + * rmi_data_create().
> + */
> + REALM_STATE_ACTIVE,
> + /**
> + * @REALM_STATE_DYING:
> + * Realm is in the process of being destroyed or has already been
> + * destroyed.
> + */
> + REALM_STATE_DYING,
> + /**
> + * @REALM_STATE_DEAD:
> + * Realm has been destroyed.
> + */
> + REALM_STATE_DEAD
> +};
What is the ABI status of this state? Is it purely internal to KVM? Or
is it something that the RMM actively tracks?
> +
> /**
> * struct realm - Additional per VM data for a Realm
> + *
> + * @state: The lifetime state machine for the realm
> + * @rd: Kernel mapping of the Realm Descriptor (RD)
> + * @params: Parameters for the RMI_REALM_CREATE command
> + * @ia_bits: Number of valid Input Address bits in the IPA
> */
> struct realm {
> + enum realm_state state;
> + void *rd;
Why is this void? Doesn't it have a proper type?
> + struct realm_params *params;
> + unsigned int ia_bits;
Consider reordering this structure to avoid holes.
> };
>
> void kvm_init_rmi(void);
> +u32 kvm_realm_ipa_limit(void);
The use of 'realm' is confusing. This is not a per-realm property, but
something global. I'd rather reserve the term 'realm' for CCA VMs (cue
the two prototypes below).
> +
> +int kvm_init_realm(struct kvm *kvm);
> +void kvm_destroy_realm(struct kvm *kvm);
>
> #endif /* __ASM_KVM_RMI_H */
> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> index 247e03b33035..18251e561524 100644
> --- a/arch/arm64/kvm/arm.c
> +++ b/arch/arm64/kvm/arm.c
> @@ -264,6 +264,13 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
>
> bitmap_zero(kvm->arch.vcpu_features, KVM_VCPU_MAX_FEATURES);
>
> + /* Initialise the realm bits after the generic bits are enabled */
> + if (kvm_is_realm(kvm)) {
> + ret = kvm_init_realm(kvm);
> + if (ret)
> + goto err_uninit_mmu;
> + }
> +
> return 0;
>
> err_uninit_mmu:
> @@ -326,6 +333,8 @@ void kvm_arch_destroy_vm(struct kvm *kvm)
> kvm_unshare_hyp(kvm, kvm + 1);
>
> kvm_arm_teardown_hypercalls(kvm);
> + if (kvm_is_realm(kvm))
> + kvm_destroy_realm(kvm);
> }
>
> static bool kvm_has_full_ptr_auth(void)
> @@ -486,6 +495,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
> else
> r = kvm_supports_cacheable_pfnmap();
> break;
> + case KVM_CAP_ARM_RMI:
> + r = static_key_enabled(&kvm_rmi_is_available);
> + break;
>
> default:
> r = 0;
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index d089c107d9b7..ba8286472286 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -877,10 +877,14 @@ static struct kvm_pgtable_mm_ops kvm_s2_mm_ops = {
>
> static int kvm_init_ipa_range(struct kvm_s2_mmu *mmu, unsigned long type)
> {
> + struct kvm *kvm = kvm_s2_mmu_to_kvm(mmu);
> u32 kvm_ipa_limit = get_kvm_ipa_limit();
> u64 mmfr0, mmfr1;
> u32 phys_shift;
>
> + if (kvm_is_realm(kvm))
> + kvm_ipa_limit = kvm_realm_ipa_limit();
> +
> phys_shift = KVM_VM_TYPE_ARM_IPA_SIZE(type);
> if (is_protected_kvm_enabled()) {
> phys_shift = kvm_ipa_limit;
> @@ -974,6 +978,8 @@ int kvm_init_stage2_mmu(struct kvm *kvm, struct kvm_s2_mmu *mmu, unsigned long t
> return -EINVAL;
> }
>
> + mmu->arch = &kvm->arch;
> +
> err = kvm_init_ipa_range(mmu, type);
> if (err)
> return err;
> @@ -982,7 +988,6 @@ int kvm_init_stage2_mmu(struct kvm *kvm, struct kvm_s2_mmu *mmu, unsigned long t
> if (!pgt)
> return -ENOMEM;
>
> - mmu->arch = &kvm->arch;
Why moving this init?
> err = KVM_PGT_FN(kvm_pgtable_stage2_init)(pgt, mmu, &kvm_s2_mm_ops);
> if (err)
> goto out_free_pgtable;
> @@ -1114,7 +1119,10 @@ void kvm_free_stage2_pgd(struct kvm_s2_mmu *mmu)
> write_unlock(&kvm->mmu_lock);
>
> if (pgt) {
> - kvm_stage2_destroy(pgt);
> + if (!kvm_is_realm(kvm))
> + kvm_stage2_destroy(pgt);
> + else
> + kvm_pgtable_stage2_destroy_pgd(pgt);
Why can't you make kvm_stage2_destroy() do the right thing? Surely the
PTs have to be reclaimed one way or another.
> kfree(pgt);
> }
> }
> diff --git a/arch/arm64/kvm/rmi.c b/arch/arm64/kvm/rmi.c
> index 6e28b669ded2..f51ec667445e 100644
> --- a/arch/arm64/kvm/rmi.c
> +++ b/arch/arm64/kvm/rmi.c
> @@ -5,6 +5,8 @@
>
> #include <linux/kvm_host.h>
>
> +#include <asm/kvm_emulate.h>
> +#include <asm/kvm_mmu.h>
> #include <asm/kvm_pgtable.h>
> #include <asm/rmi_cmds.h>
> #include <asm/virt.h>
> @@ -14,6 +16,61 @@ static bool rmi_has_feature(unsigned long feature)
> return !!u64_get_bits(rmm_feat_reg0, feature);
> }
>
> +u32 kvm_realm_ipa_limit(void)
> +{
> + return u64_get_bits(rmm_feat_reg0, RMI_FEATURE_REGISTER_0_S2SZ);
> +}
> +
> +void kvm_destroy_realm(struct kvm *kvm)
> +{
> + struct realm *realm = &kvm->arch.realm;
> + size_t pgd_size = kvm_pgtable_stage2_pgd_size(kvm->arch.mmu.vtcr);
> +
> + if (realm->params) {
> + free_page((unsigned long)realm->params);
> + realm->params = NULL;
> + }
> +
> + if (!kvm_realm_is_created(kvm))
> + return;
> +
> + kvm_set_realm_state(kvm, REALM_STATE_DYING);
> +
> + write_lock(&kvm->mmu_lock);
> + kvm_stage2_unmap_range(&kvm->arch.mmu, 0,
> + BIT(realm->ia_bits - 1), true);
> + write_unlock(&kvm->mmu_lock);
> +
> + if (realm->rd) {
> + phys_addr_t rd_phys = virt_to_phys(realm->rd);
> +
> + if (WARN_ON(rmi_realm_terminate(rd_phys)))
> + return;
> +
> + if (WARN_ON(rmi_realm_destroy(rd_phys)))
> + return;
> + free_delegated_page(rd_phys);
> + realm->rd = NULL;
> + }
> +
> + if (WARN_ON(rmi_undelegate_range(kvm->arch.mmu.pgd_phys, pgd_size)))
> + return;
> +
> + kvm_set_realm_state(kvm, REALM_STATE_DEAD);
> +
> + /* Now that the Realm is destroyed, free the entry level RTTs */
> + kvm_free_stage2_pgd(&kvm->arch.mmu);
> +}
This really needs documentation: what happens at each stage? What
memory is reclaimed when?
But even more importantly, why is this built in a completely parallel
way, potentially deviating from the existing KVM S2 management?
Thanks,
M.
--
Without deviation from the norm, progress is not possible.