Re: [PATCH v3 1/3] KVM: x86: Deflect unknown MSR accesses to user space

From: Alexander Graf
Date: Mon Aug 03 2020 - 07:34:46 EST

On 03.08.20 13:27, Vitaly Kuznetsov wrote:
Alexander Graf <graf@xxxxxxxxxx> writes:

MSRs are weird. Some of them are normal control registers, such as EFER.
Some however are registers that really are model specific, not very
interesting to virtualization workloads, and not performance critical.
Others again are really just windows into package configuration.

Out of these MSRs, only the first category is necessary to implement in
kernel space. Rarely accessed MSRs, MSRs that should be fine tunes against
certain CPU models and MSRs that contain information on the package level
are much better suited for user space to process. However, over time we have
accumulated a lot of MSRs that are not the first category, but still handled
by in-kernel KVM code.

This patch adds a generic interface to handle WRMSR and RDMSR from user
space. With this, any future MSR that is part of the latter categories can
be handled in user space.

Furthermore, it allows us to replace the existing "ignore_msrs" logic with
something that applies per-VM rather than on the full system. That way you
can run productive VMs in parallel to experimental ones where you don't care
about proper MSR handling.

Signed-off-by: Alexander Graf <graf@xxxxxxxxxx>


v1 -> v2:

- deflect all #GP injection events to user space, not just unknown MSRs.
That was we can also deflect allowlist errors later
- fix emulator case

v2 -> v3:

- return r if r == X86EMUL_IO_NEEDED
- Use complete_userspace_io logic instead of reply field
- Simplify trapping code
Documentation/virt/kvm/api.rst | 62 +++++++++++++++++++
arch/x86/include/asm/kvm_host.h | 6 ++
arch/x86/kvm/emulate.c | 18 +++++-
arch/x86/kvm/x86.c | 106 ++++++++++++++++++++++++++++++--
include/trace/events/kvm.h | 2 +-
include/uapi/linux/kvm.h | 10 +++
6 files changed, 197 insertions(+), 7 deletions(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 320788f81a05..79c3e2fdfae4 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -5155,6 +5155,35 @@ Note that KVM does not skip the faulting instruction as it does for
KVM_EXIT_MMIO, but userspace has to emulate any change to the processing state
if it decides to decode and emulate the instruction.

+ struct {
+ __u8 error;
+ __u8 pad[3];
+ __u32 index;
+ __u64 data;
+ } msr;
+Used on x86 systems. When the VM capability KVM_CAP_X86_USER_SPACE_MSR is
+enabled, MSR accesses to registers that would invoke a #GP by KVM kernel code
+will instead trigger a KVM_EXIT_X86_RDMSR exit for reads and KVM_EXIT_X86_WRMSR
+exit for writes.
+For KVM_EXIT_X86_RDMSR, the "index" field tells user space which MSR the guest
+wants to read. To respond to this request with a successful read, user space
+writes the respective data into the "data" field and must continue guest
+execution to ensure the read data is transferred into guest register state.
+If the RDMSR request was unsuccessful, user space indicates that with a "1" in
+the "error" field. This will inject a #GP into the guest when the VCPU is
+executed again.
+For KVM_EXIT_X86_WRMSR, the "index" field tells user space which MSR the guest
+wants to write. Once finished processing the event, user space must continue
+vCPU execution. If the MSR write was unsuccessful, user space also sets the
+"error" field to "1".

/* Fix the size of the union. */
@@ -5844,6 +5873,28 @@ controlled by the kvm module parameter halt_poll_ns. This capability allows
the maximum halt time to specified on a per-VM basis, effectively overriding
the module parameter for the target VM.

+:Architectures: x86
+:Target: VM
+:Parameters: args[0] is 1 if user space MSR handling is enabled, 0 otherwise
+:Returns: 0 on success; -1 on error
+This capability enables trapping of #GP invoking RDMSR and WRMSR instructions
+into user space.
+When a guest requests to read or write an MSR, KVM may not implement all MSRs
+that are relevant to a respective system. It also does not differentiate by
+CPU type.
+To allow more fine grained control over MSR handling, user space may enable
+this capability. With it enabled, MSR accesses that would usually trigger
+a #GP event inside the guest by KVM will instead trigger KVM_EXIT_X86_RDMSR
+and KVM_EXIT_X86_WRMSR exit notifications which user space can then handle to
+implement model specific MSR handling and/or user notifications to inform
+a user that an MSR was not handled.
8. Other capabilities.

@@ -6151,3 +6202,14 @@ KVM can therefore start protected VMs.
This capability governs the KVM_S390_PV_COMMAND ioctl and the
guests when the state change is invalid.
+:Architectures: x86
+This capability indicates that KVM supports deflection of MSR reads and
+writes to user space. It can be enabled on a VM level. If enabled, MSR
+accesses that would usually trigger a #GP by KVM into the guest will
+instead get bounced to user space through the KVM_EXIT_X86_RDMSR and
+KVM_EXIT_X86_WRMSR exit notifications.
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index be5363b21540..809eed0dbdea 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -829,6 +829,9 @@ struct kvm_vcpu_arch {

/* AMD MSRC001_0015 Hardware Configuration */
u64 msr_hwcr;
+ /* User space is handling an MSR request */
+ bool pending_user_msr;

struct kvm_lpage_info {
@@ -1002,6 +1005,9 @@ struct kvm_arch {
bool guest_can_read_msr_platform_info;
bool exception_payload_enabled;

+ /* Deflect RDMSR and WRMSR to user space when they trigger a #GP */
+ bool user_space_msr_enabled;
struct kvm_pmu_event_filter *pmu_event_filter;
struct task_struct *nx_lpage_recovery_thread;
diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index d0e2825ae617..744ab9c92b73 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -3689,11 +3689,18 @@ static int em_dr_write(struct x86_emulate_ctxt *ctxt)

static int em_wrmsr(struct x86_emulate_ctxt *ctxt)
+ u64 msr_index = reg_read(ctxt, VCPU_REGS_RCX);
u64 msr_data;
+ int r;

msr_data = (u32)reg_read(ctxt, VCPU_REGS_RAX)
| ((u64)reg_read(ctxt, VCPU_REGS_RDX) << 32);
- if (ctxt->ops->set_msr(ctxt, reg_read(ctxt, VCPU_REGS_RCX), msr_data))
+ r = ctxt->ops->set_msr(ctxt, msr_index, msr_data);
+ if (r == X86EMUL_IO_NEEDED)
+ return r;
+ if (r)
return emulate_gp(ctxt, 0);

@@ -3701,9 +3708,16 @@ static int em_wrmsr(struct x86_emulate_ctxt *ctxt)

static int em_rdmsr(struct x86_emulate_ctxt *ctxt)
+ u64 msr_index = reg_read(ctxt, VCPU_REGS_RCX);
u64 msr_data;
+ int r;
+ r = ctxt->ops->get_msr(ctxt, msr_index, &msr_data);
+ if (r == X86EMUL_IO_NEEDED)
+ return r;

- if (ctxt->ops->get_msr(ctxt, reg_read(ctxt, VCPU_REGS_RCX), &msr_data))
+ if (r)
return emulate_gp(ctxt, 0);

*reg_write(ctxt, VCPU_REGS_RAX) = (u32)msr_data;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 88c593f83b28..24c72250f6df 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1549,12 +1549,75 @@ int kvm_set_msr(struct kvm_vcpu *vcpu, u32 index, u64 data)

+static int complete_emulated_msr(struct kvm_vcpu *vcpu, bool is_read)
+ BUG_ON(!vcpu->arch.pending_user_msr);
+ if (vcpu->run->msr.error) {
+ kvm_inject_gp(vcpu, 0);
+ } else if (is_read) {
+ kvm_rax_write(vcpu, (u32)vcpu->run->;
+ kvm_rdx_write(vcpu, vcpu->run-> >> 32);
+ }
+ return kvm_skip_emulated_instruction(vcpu);
+static int complete_emulated_rdmsr(struct kvm_vcpu *vcpu)
+ return complete_emulated_msr(vcpu, true);
+static int complete_emulated_wrmsr(struct kvm_vcpu *vcpu)
+ return complete_emulated_msr(vcpu, false);
+static int kvm_get_msr_user_space(struct kvm_vcpu *vcpu, u32 index)
+ if (!vcpu->kvm->arch.user_space_msr_enabled)
+ return 0;
+ vcpu->run->exit_reason = KVM_EXIT_X86_RDMSR;
+ vcpu->run->msr.error = 0;
+ vcpu->run->msr.index = index;
+ vcpu->arch.pending_user_msr = true;
+ vcpu->arch.complete_userspace_io = complete_emulated_rdmsr;
+ return 1;
+static int kvm_set_msr_user_space(struct kvm_vcpu *vcpu, u32 index, u64 data)
+ if (!vcpu->kvm->arch.user_space_msr_enabled)
+ return 0;
+ vcpu->run->exit_reason = KVM_EXIT_X86_WRMSR;
+ vcpu->run->msr.error = 0;
+ vcpu->run->msr.index = index;
+ vcpu->run-> = data;
+ vcpu->arch.pending_user_msr = true;
+ vcpu->arch.complete_userspace_io = complete_emulated_wrmsr;

I'm probably missing something but where do we reset
vcpu->arch.pending_user_msr? Shouldn't it be done in

It's even worse than that: We don't need it at all. I'll remove it for v4.


Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879