Re: [PATCH v6 7/8] arm64: KVM: Write TRFCR value on guest switch with nVHE

From: Marc Zyngier
Date: Mon Feb 26 2024 - 12:50:32 EST


On Mon, 26 Feb 2024 11:30:35 +0000,
James Clark <james.clark@xxxxxxx> wrote:
>
> The guest value for TRFCR requested by the Coresight driver is saved in
> kvm_guest_trfcr. On guest switch this value needs to be written to
> the register. Currently TRFCR is only modified when we want to disable
> trace completely in guests due to an issue with TRBE. Expand the
> __debug_save_trace() function to always write to the register if a
> different value for guests is required, but also keep the existing TRBE
> disable behavior if that's required.
>
> In pKVM, the kvm_guest_trfcr can't be read and the host isn't trusted,
> so always disable trace.
>
> __debug_restore_trace() now has to restore unconditionally, because even
> a value of 0 needs to be written to overwrite whatever was set for the
> guest.
>
> Reviewed-by: Suzuki K Poulose <suzuki.poulose@xxxxxxx>
> Signed-off-by: James Clark <james.clark@xxxxxxx>
> ---
> arch/arm64/kvm/hyp/nvhe/debug-sr.c | 53 +++++++++++++++++-------------
> 1 file changed, 31 insertions(+), 22 deletions(-)
>
> diff --git a/arch/arm64/kvm/hyp/nvhe/debug-sr.c b/arch/arm64/kvm/hyp/nvhe/debug-sr.c
> index 4558c02eb352..3adac2e01908 100644
> --- a/arch/arm64/kvm/hyp/nvhe/debug-sr.c
> +++ b/arch/arm64/kvm/hyp/nvhe/debug-sr.c
> @@ -51,30 +51,39 @@ static void __debug_restore_spe(u64 pmscr_el1)
> write_sysreg_s(pmscr_el1, SYS_PMSCR_EL1);
> }
>
> -static void __debug_save_trace(u64 *trfcr_el1)
> +static void __debug_save_trace(struct kvm_vcpu *vcpu)
> {
> - *trfcr_el1 = 0;
> -
> - /* Check if the TRBE is enabled */
> - if (!(read_sysreg_s(SYS_TRBLIMITR_EL1) & TRBLIMITR_EL1_E))
> - return;
> - /*
> - * Prohibit trace generation while we are in guest.
> - * Since access to TRFCR_EL1 is trapped, the guest can't
> - * modify the filtering set by the host.
> - */
> - *trfcr_el1 = read_sysreg_s(SYS_TRFCR_EL1);
> - write_sysreg_s(0, SYS_TRFCR_EL1);
> - isb();
> - /* Drain the trace buffer to memory */
> - tsb_csync();
> + u64 host_trfcr_el1 = read_sysreg_s(SYS_TRFCR_EL1);
> + u64 guest_trfcr_el1;
> +
> + vcpu->arch.host_debug_state.trfcr_el1 = host_trfcr_el1;

Huh, this madness has to stop. See patch below. The short story is
that we have to stop shoving host state in vcpus. This is gross, and a
stupid waste of memory.

> +
> + /* Check if the TRBE buffer or pKVM is enabled */
> + if (is_protected_kvm_enabled() ||
> + (vcpu_get_flag(vcpu, DEBUG_STATE_SAVE_TRBE) &&
> + read_sysreg_s(SYS_TRBLIMITR_EL1) & TRBLIMITR_EL1_E)) {
> + /*
> + * Prohibit trace generation while we are in guest. Since access
> + * to TRFCR_EL1 is trapped, the guest can't modify the filtering
> + * set by the host.
> + */
> + write_sysreg_s(0, SYS_TRFCR_EL1);
> + isb();
> + /* Drain the trace buffer to memory */
> + tsb_csync();
> + } else {
> + /*
> + * Tracing is allowed, apply the filters provided by the
> + * Coresight driver.
> + */
> + guest_trfcr_el1 = kvm_guest_trfcr[vcpu->cpu];
> + if (host_trfcr_el1 != guest_trfcr_el1)
> + write_sysreg_s(guest_trfcr_el1, SYS_TRFCR_EL1);

So we have 3 pieces of storage for TRFCR_EL1:

- the system register itself
- the copy in host_debug_state, which is only used transiently
- another version in kvm_guest_trfcr, provided by Coresight

Why do we need to save anything if nothing was enabled, which is *all
the time*? I'm sorry to break it to you, but nobody uses these
features. So I'd like them to have zero cost when not in use.

> + }
> }
>
> static void __debug_restore_trace(u64 trfcr_el1)
> {
> - if (!trfcr_el1)
> - return;
> -
> /* Restore trace filter controls */
> write_sysreg_s(trfcr_el1, SYS_TRFCR_EL1);
> }
> @@ -85,8 +94,8 @@ void __debug_save_host_buffers_nvhe(struct kvm_vcpu *vcpu)
> if (vcpu_get_flag(vcpu, DEBUG_STATE_SAVE_SPE))
> __debug_save_spe(&vcpu->arch.host_debug_state.pmscr_el1);
> /* Disable and flush Self-Hosted Trace generation */
> - if (vcpu_get_flag(vcpu, DEBUG_STATE_SAVE_TRBE))
> - __debug_save_trace(&vcpu->arch.host_debug_state.trfcr_el1);
> + if (vcpu_get_flag(vcpu, DEBUG_STATE_SAVE_TRFCR))
> + __debug_save_trace(vcpu);

The more I read this code, the less I understand why we need these
flags. DEBUG_STATE_SAVE_TRFCR really means "I support TRF". But we
already have that information in the ID registers, and we could cache
it on a per-physical CPU basis instead of per-vcpu. Hell, on an
homogeneous system, this could even be a static key. Do we even have
systems out there where only half the CPUs support TRF?

Then we check whether TRBE is enabled. But if it isn't, we randomly
write whatever is in kvm_guest_trfcr[]? Why would we do that? Surely
there is something there that should say "yup, tracing" or not (such
as the enable bits), which would avoid hitting the sysreg pointlessly?

I really think that this logic should be:

- strictly based on ID registers or even better, static keys
- result in close to 0 system register access when not in use
- avoid storing state that doesn't need to be stored

Thanks,

M.

From a3e98d8428d854209f0e97aa38d1bee347c503f2 Mon Sep 17 00:00:00 2001
From: Marc Zyngier <maz@xxxxxxxxxx>
Date: Mon, 26 Feb 2024 15:58:46 +0000
Subject: [PATCH] KVM: arm64: Exclude host_debug_data from vcpu_arch

Keeping host_debug_state on a per-vcpu basis is completely
pointless. The lifetime of this data is only that of the inner
run-loop, which means it is never accessed outside of the core
EL2 code.

Move the structure into kvm_host_data, and save over 500 bytes
per vcpu.

Signed-off-by: Marc Zyngier <maz@xxxxxxxxxx>
---
arch/arm64/include/asm/kvm_host.h | 31 +++++++++++++----------
arch/arm64/kvm/hyp/include/hyp/debug-sr.h | 4 +--
arch/arm64/kvm/hyp/nvhe/debug-sr.c | 8 +++---
3 files changed, 23 insertions(+), 20 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index addf79ba8fa0..599de77a232f 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -601,6 +601,19 @@ struct kvm_cpu_context {

struct kvm_host_data {
struct kvm_cpu_context host_ctxt;
+
+ /*
+ * host_debug_state contains the host registers which are
+ * saved and restored during world switches.
+ */
+ struct {
+ /* {Break,watch}point registers */
+ struct kvm_guest_debug_arch regs;
+ /* Statistical profiling extension */
+ u64 pmscr_el1;
+ /* Self-hosted trace */
+ u64 trfcr_el1;
+ } host_debug_state;
};

struct kvm_host_psci_config {
@@ -695,11 +708,10 @@ struct kvm_vcpu_arch {
* We maintain more than a single set of debug registers to support
* debugging the guest from the host and to maintain separate host and
* guest state during world switches. vcpu_debug_state are the debug
- * registers of the vcpu as the guest sees them. host_debug_state are
- * the host registers which are saved and restored during
- * world switches. external_debug_state contains the debug
- * values we want to debug the guest. This is set via the
- * KVM_SET_GUEST_DEBUG ioctl.
+ * registers of the vcpu as the guest sees them.
+ *
+ * external_debug_state contains the debug values we want to debug the
+ * guest. This is set via the KVM_SET_GUEST_DEBUG ioctl.
*
* debug_ptr points to the set of debug registers that should be loaded
* onto the hardware when running the guest.
@@ -711,15 +723,6 @@ struct kvm_vcpu_arch {
struct user_fpsimd_state *host_fpsimd_state; /* hyp VA */
struct task_struct *parent_task;

- struct {
- /* {Break,watch}point registers */
- struct kvm_guest_debug_arch regs;
- /* Statistical profiling extension */
- u64 pmscr_el1;
- /* Self-hosted trace */
- u64 trfcr_el1;
- } host_debug_state;
-
/* VGIC state */
struct vgic_cpu vgic_cpu;
struct arch_timer_cpu timer_cpu;
diff --git a/arch/arm64/kvm/hyp/include/hyp/debug-sr.h b/arch/arm64/kvm/hyp/include/hyp/debug-sr.h
index 961bbef104a6..d2a40eb82f15 100644
--- a/arch/arm64/kvm/hyp/include/hyp/debug-sr.h
+++ b/arch/arm64/kvm/hyp/include/hyp/debug-sr.h
@@ -137,7 +137,7 @@ static inline void __debug_switch_to_guest_common(struct kvm_vcpu *vcpu)

host_ctxt = &this_cpu_ptr(&kvm_host_data)->host_ctxt;
guest_ctxt = &vcpu->arch.ctxt;
- host_dbg = &vcpu->arch.host_debug_state.regs;
+ host_dbg = &this_cpu_ptr(&kvm_host_data)->host_debug_state.regs;
guest_dbg = kern_hyp_va(vcpu->arch.debug_ptr);

__debug_save_state(host_dbg, host_ctxt);
@@ -156,7 +156,7 @@ static inline void __debug_switch_to_host_common(struct kvm_vcpu *vcpu)

host_ctxt = &this_cpu_ptr(&kvm_host_data)->host_ctxt;
guest_ctxt = &vcpu->arch.ctxt;
- host_dbg = &vcpu->arch.host_debug_state.regs;
+ host_dbg = &this_cpu_ptr(&kvm_host_data)->host_debug_state.regs;
guest_dbg = kern_hyp_va(vcpu->arch.debug_ptr);

__debug_save_state(guest_dbg, guest_ctxt);
diff --git a/arch/arm64/kvm/hyp/nvhe/debug-sr.c b/arch/arm64/kvm/hyp/nvhe/debug-sr.c
index 4558c02eb352..8103f8c695b4 100644
--- a/arch/arm64/kvm/hyp/nvhe/debug-sr.c
+++ b/arch/arm64/kvm/hyp/nvhe/debug-sr.c
@@ -83,10 +83,10 @@ void __debug_save_host_buffers_nvhe(struct kvm_vcpu *vcpu)
{
/* Disable and flush SPE data generation */
if (vcpu_get_flag(vcpu, DEBUG_STATE_SAVE_SPE))
- __debug_save_spe(&vcpu->arch.host_debug_state.pmscr_el1);
+ __debug_save_spe(&this_cpu_ptr(&kvm_host_data)->host_debug_state.pmscr_el1);
/* Disable and flush Self-Hosted Trace generation */
if (vcpu_get_flag(vcpu, DEBUG_STATE_SAVE_TRBE))
- __debug_save_trace(&vcpu->arch.host_debug_state.trfcr_el1);
+ __debug_save_trace(&this_cpu_ptr(&kvm_host_data)->host_debug_state.trfcr_el1);
}

void __debug_switch_to_guest(struct kvm_vcpu *vcpu)
@@ -97,9 +97,9 @@ void __debug_switch_to_guest(struct kvm_vcpu *vcpu)
void __debug_restore_host_buffers_nvhe(struct kvm_vcpu *vcpu)
{
if (vcpu_get_flag(vcpu, DEBUG_STATE_SAVE_SPE))
- __debug_restore_spe(vcpu->arch.host_debug_state.pmscr_el1);
+ __debug_restore_spe(this_cpu_ptr(&kvm_host_data)->host_debug_state.pmscr_el1);
if (vcpu_get_flag(vcpu, DEBUG_STATE_SAVE_TRBE))
- __debug_restore_trace(vcpu->arch.host_debug_state.trfcr_el1);
+ __debug_restore_trace(this_cpu_ptr(&kvm_host_data)->host_debug_state.trfcr_el1);
}

void __debug_switch_to_host(struct kvm_vcpu *vcpu)
--
2.39.2


--
Without deviation from the norm, progress is not possible.