Re: [PATCH v7 18/45] arm64: RME: Handle RMI_EXIT_RIPAS_CHANGE

From: Gavin Shan
Date: Tue Apr 08 2025 - 20:14:09 EST


Hi Steve,

On 4/8/25 2:34 AM, Steven Price wrote:
On 04/03/2025 04:35, Gavin Shan wrote:
On 2/14/25 2:13 AM, Steven Price wrote:
The guest can request that a region of it's protected address space is
switched between RIPAS_RAM and RIPAS_EMPTY (and back) using
RSI_IPA_STATE_SET. This causes a guest exit with the
RMI_EXIT_RIPAS_CHANGE code. We treat this as a request to convert a
protected region to unprotected (or back), exiting to the VMM to make
the necessary changes to the guest_memfd and memslot mappings. On the
next entry the RIPAS changes are committed by making RMI_RTT_SET_RIPAS
calls.

The VMM may wish to reject the RIPAS change requested by the guest. For
now it can only do with by no longer scheduling the VCPU as we don't
currently have a usecase for returning that rejection to the guest, but
by postponing the RMI_RTT_SET_RIPAS changes to entry we leave the door
open for adding a new ioctl in the future for this purpose.

Signed-off-by: Steven Price <steven.price@xxxxxxx>
---
New patch for v7: The code was previously split awkwardly between two
other patches.
---
  arch/arm64/kvm/rme.c | 87 ++++++++++++++++++++++++++++++++++++++++++++
  1 file changed, 87 insertions(+)


With the following comments addressed:

Reviewed-by: Gavin Shan <gshan@xxxxxxxxxx>

diff --git a/arch/arm64/kvm/rme.c b/arch/arm64/kvm/rme.c
index 507eb4b71bb7..f965869e9ef7 100644
--- a/arch/arm64/kvm/rme.c
+++ b/arch/arm64/kvm/rme.c
@@ -624,6 +624,64 @@ void kvm_realm_unmap_range(struct kvm *kvm,
unsigned long start, u64 size,
          realm_unmap_private_range(kvm, start, end);
  }
  +static int realm_set_ipa_state(struct kvm_vcpu *vcpu,
+                   unsigned long start,
+                   unsigned long end,
+                   unsigned long ripas,
+                   unsigned long *top_ipa)
+{
+    struct kvm *kvm = vcpu->kvm;
+    struct realm *realm = &kvm->arch.realm;
+    struct realm_rec *rec = &vcpu->arch.rec;
+    phys_addr_t rd_phys = virt_to_phys(realm->rd);
+    phys_addr_t rec_phys = virt_to_phys(rec->rec_page);
+    struct kvm_mmu_memory_cache *memcache = &vcpu->arch.mmu_page_cache;
+    unsigned long ipa = start;
+    int ret = 0;
+
+    while (ipa < end) {
+        unsigned long next;
+
+        ret = rmi_rtt_set_ripas(rd_phys, rec_phys, ipa, end, &next);
+

This doesn't look correct to me. Looking at RMM::smc_rtt_set_ripas(),
it's possible
the SMC call is returned without updating 'next' to a valid address. In
this case,
the garbage content resident in 'next' can be used to updated to 'ipa'
in next
iternation. So we need to initialize it in advance, like below.

    unsigned long ipa = start;
    unsigned long next = start;

    while (ipa < end) {
        ret = rmi_rtt_set_ripas(rd_phys, rec_phys, ipa, end, &next);

I agree this might not be the clearest code, but 'next' should be set if
the return state is RMI_SUCCESS, and we don't actually get to the "ipa =
next" line unless that is the case. But I'll rejig things because it's
not clear.


Yes, 'next' is always updated when RMI_SUCCESS is returned. However, 'next'
won't be updated when RMI_ERROR_RTT is returned. I've overlooked the code,
when RMI_ERROR_RTT is returned for the first iteration, 'ipa' is kept as
intact and the 'ipa' is retried after stage2 page-table is populated. So
everything should be fine.

+        if (RMI_RETURN_STATUS(ret) == RMI_ERROR_RTT) {
+            int walk_level = RMI_RETURN_INDEX(ret);
+            int level = find_map_level(realm, ipa, end);
+
+            /*
+             * If the RMM walk ended early then more tables are
+             * needed to reach the required depth to set the RIPAS.
+             */
+            if (walk_level < level) {
+                ret = realm_create_rtt_levels(realm, ipa,
+                                  walk_level,
+                                  level,
+                                  memcache);
+                /* Retry with RTTs created */
+                if (!ret)
+                    continue;
+            } else {
+                ret = -EINVAL;
+            }
+
+            break;
+        } else if (RMI_RETURN_STATUS(ret) != RMI_SUCCESS) {
+            WARN(1, "Unexpected error in %s: %#x\n", __func__,
+                 ret);
+            ret = -EINVAL;

            ret = -ENXIO;

Ack

+            break;
+        }
+        ipa = next;
+    }
+
+    *top_ipa = ipa;
+
+    if (ripas == RMI_EMPTY && ipa != start)
+        realm_unmap_private_range(kvm, start, ipa);
+
+    return ret;
+}
+
  static int realm_init_ipa_state(struct realm *realm,
                  unsigned long ipa,
                  unsigned long end)
@@ -863,6 +921,32 @@ void kvm_destroy_realm(struct kvm *kvm)
      kvm_free_stage2_pgd(&kvm->arch.mmu);
  }
  +static void kvm_complete_ripas_change(struct kvm_vcpu *vcpu)
+{
+    struct kvm *kvm = vcpu->kvm;
+    struct realm_rec *rec = &vcpu->arch.rec;
+    unsigned long base = rec->run->exit.ripas_base;
+    unsigned long top = rec->run->exit.ripas_top;
+    unsigned long ripas = rec->run->exit.ripas_value;
+    unsigned long top_ipa;
+    int ret;
+

Some checks are needed here to ensure the addresses (@base and @top)
falls inside
the protected (private) space for two facts: (1) Those parameters
originates from
the guest, which can be misbehaving. (2) RMM::smc_rtt_set_ripas() isn't
limited to
the private space, meaning it also can change RIPAS for the ranges in
the shared
space.

I might be missing something, but AFAICT this is safe:

1. The RMM doesn't permit RIPAS changes within the shared space:
RSI_IPA_STATE_SET has a precondition [rgn_bound]:
AddrRangeIsProtected(base, top, realm)
So a malicious guest shouldn't get passed the RMM.

2. The RMM validates that the range passed here is (a subset of) the
one provided to the NS-world [base_bound / top_bound].

And even if somehow a malicious guest managed to bypass these checks I
don't see how it would cause harm to the host operating on the wrong region.

I'm happy to be corrected though! What am I missing?


No, you don't miss anything, I did. I missed that the requested address range
is ensured to be part of the private space by RMM::handle_rsi_ipa_state_set().
So everything should be fine.

void handle_rsi_ipa_state_set(struct rec *rec,
struct rmi_rec_exit *rec_exit,
struct rsi_result *res)
{
:
if ((ripas_val > RIPAS_RAM) ||
!GRANULE_ALIGNED(base) || !GRANULE_ALIGNED(top) ||
(top <= base) || /* Size is zero, or range overflows */
!region_in_rec_par(rec, base, top)) {
res->action = UPDATE_REC_RETURN_TO_REALM;
res->smc_res.x[0] = RSI_ERROR_INPUT;
return;
}
:
}


Thank,
Steve

+    do {
+        kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_page_cache,
+                       kvm_mmu_cache_min_pages(vcpu->arch.hw_mmu));
+        write_lock(&kvm->mmu_lock);
+        ret = realm_set_ipa_state(vcpu, base, top, ripas, &top_ipa);
+        write_unlock(&kvm->mmu_lock);
+
+        if (WARN_RATELIMIT(ret && ret != -ENOMEM,
+                   "Unable to satisfy RIPAS_CHANGE for %#lx - %#lx,
ripas: %#lx\n",
+                   base, top, ripas))
+            break;
+
+        base = top_ipa;
+    } while (top_ipa < top);
+}
+
  int kvm_rec_enter(struct kvm_vcpu *vcpu)
  {
      struct realm_rec *rec = &vcpu->arch.rec;
@@ -873,6 +957,9 @@ int kvm_rec_enter(struct kvm_vcpu *vcpu)
          for (int i = 0; i < REC_RUN_GPRS; i++)
              rec->run->enter.gprs[i] = vcpu_get_reg(vcpu, i);
          break;
+    case RMI_EXIT_RIPAS_CHANGE:
+        kvm_complete_ripas_change(vcpu);
+        break;
      }
        if (kvm_realm_state(vcpu->kvm) != REALM_STATE_ACTIVE)


Thanks,
Gavin