Re: KVM exit to userspace on WFI

From: Jan Henrik Weinstock
Date: Tue Oct 31 2023 - 15:21:36 EST


Am Mo., 30. Okt. 2023 um 13:36 Uhr schrieb Marc Zyngier <maz@xxxxxxxxxx>:
>
> [please make an effort not to top-post]
>
> On Fri, 27 Oct 2023 18:41:44 +0100,
> Jan Henrik Weinstock <jan@xxxxxx> wrote:
> >
> > Hi Marc,
> >
> > the basic idea behind this is to have a (single-threaded) execution loop,
> > something like this:
> >
> > vcpu-thread: vcpu-run | process-io-devices | vcpu-run | process-io...
> > ^
> > WFX or timeout
> >
> > We switch to simulating IO devices whenever the vcpu is idle (wfi) or exceeds
> > a certain budget of instructions (counted via pmu). Our fallback currently is
> > to kick the vcpu out of its execution using a signal (via a timeout/alarm). But
> > of course, if the cpu is stuck at a wfi, we are wasting a lot of time.
> >
> > I understand that the proposed behavior is not desirable for most use cases,
> > which is why I suggest locking it behind a flag, e.g.
> > KVM_ARCH_FLAG_WFX_EXIT_TO_USER.
>
> But how do you reconcile the fact that exposing this to userspace
> breaks fundamental expectations that the guest has, such as getting
> its timer interrupts and directly injected LPIs? Implementing WFI in
> userspace breaks it. What about the case where we don't trap WFx and
> let the *guest* wait for an interrupt?

Timer interrupts etc. will be injected into the vcpu during the
io-phases. When there are no interrupts present and the guest performs
a WFI, we can just skip forward to the next timer event.

> Honestly, what you are describing seems to be a use model that doesn't
> fit KVM, which is a general purpose hypervisor, but more a simulation
> environment. Yes, the primitives are the same, but the plumbing is
> wildly different.

Agreed.

> *If* that's the stuff you're looking at, then I'm afraid you'll have
> to do it in different way, because what you are suggesting is
> fundamentally incompatible with the guarantees that KVM gives to guest
> and userspace. Because your KVM_ARCH_FLAG_WFX_EXIT_TO_USER is really a
> lie. It should really be named something more along the lines of
> KVM_ARCH_FLAG_WFX_EXIT_TO_USER_SOMETIME_AND_I_DONT_EVEN_KNOW_WHEN
> (probably with additional clauses related to breaking things).

I have attached a reworked version of the patch as a reference (based
on my 5.15 kernel). It puts the modified behavior behind a new
capability so as to not interfere with the current expectations
towards handling WFI/WFE.
I think it should now trap all blocking calls to WFx on the vcpu and
reliably return to the userspace. If I have missed something that
would cause the vcpu to not trap on a WFI kindly let me know.

> Overall, you are still asking for something that is not guaranteed at
> the architecture level, even less in KVM, and I'm not going to add
> support for something that can only work "sometime".

I am not quite sure what you mean with "sometime". Are you referring
to WFIs as NOPs? Or WFIs that do not yield because of pending
interrupts?

The point of my patch is not to accurately count every single WFI. The
point is to prevent the host cpu from sleeping just because my vcpu
executed a WFI somewhere in the guest software. If a WFI is executed
by the guest and that does not result in my vcpu thread to block (in
other words: the vcpu continues executing instructions beyond the WFI)
then it also should not exit to userspace. So instead of
"KVM_ARCH_FLAG_WFX_EXIT_TO_USER_SOMETIME_AND_I_DONT_EVEN_KNOW_WHEN" it
is really "KVM_ARCH_FLAG_WFX_EXIT_TO_USER_WHENEVER_YOU_WOULD_OTHERWISE_YIELD_AND_I_CANNOT_GET_MY_THREAD_BACK".

> M.
>
> --
> Without deviation from the norm, progress is not possible.



--
Dr.-Ing. Jan Henrik Weinstock
Managing Director

MachineWare GmbH | www.machineware.de
Hühnermarkt 19, 52062 Aachen, Germany
Amtsgericht Aachen HRB25734

Geschäftsführung
Lukas Jünger
Dr.-Ing. Jan Henrik Weinstock
diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index fc6ee6c59..c3107506b 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -136,6 +136,9 @@ struct kvm_arch {

/* Memory Tagging Extension enabled for the guest */
bool mte_enabled;
+
+ /* Exit on WFI/WFE */
+ bool exit_on_wfx;
};

struct kvm_vcpu_fault_info {
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index f181527f9..6d54dfbae 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -101,6 +101,10 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
}
mutex_unlock(&kvm->lock);
break;
+ case KVM_CAP_EXIT_ON_WFX:
+ r = 0;
+ kvm->arch.exit_on_wfx = true;
+ break;
default:
r = -EINVAL;
break;
@@ -215,6 +219,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
case KVM_CAP_SET_GUEST_DEBUG:
case KVM_CAP_VCPU_ATTRIBUTES:
case KVM_CAP_PTP_KVM:
+ case KVM_CAP_EXIT_ON_WFX:
r = 1;
break;
case KVM_CAP_SET_GUEST_DEBUG2:
@@ -394,8 +399,10 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
{
struct kvm_s2_mmu *mmu;
int *last_ran;
+ bool exit_on_wfx;

mmu = vcpu->arch.hw_mmu;
+ exit_on_wfx = vcpu->kvm->arch.exit_on_wfx;
last_ran = this_cpu_ptr(mmu->last_vcpu_ran);

/*
@@ -423,7 +430,7 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
if (kvm_arm_is_pvtime_enabled(&vcpu->arch))
kvm_make_request(KVM_REQ_RECORD_STEAL, vcpu);

- if (single_task_running())
+ if (single_task_running() && !exit_on_wfx)
vcpu_clear_wfx_traps(vcpu);
else
vcpu_set_wfx_traps(vcpu);
diff --git a/arch/arm64/kvm/handle_exit.c b/arch/arm64/kvm/handle_exit.c
index a5ab52150..80fa6bdef 100644
--- a/arch/arm64/kvm/handle_exit.c
+++ b/arch/arm64/kvm/handle_exit.c
@@ -91,10 +91,21 @@ static int kvm_handle_wfx(struct kvm_vcpu *vcpu)
if (kvm_vcpu_get_esr(vcpu) & ESR_ELx_WFx_ISS_WFE) {
trace_kvm_wfx_arm64(*vcpu_pc(vcpu), true);
vcpu->stat.wfe_exit_stat++;
+ if (vcpu->kvm->arch.exit_on_wfx) {
+ vcpu->run->exit_reason = KVM_EXIT_WFX;
+ vcpu->run->wfx.esr = kvm_vcpu_get_esr(vcpu);
+ return 0;
+ }
+
kvm_vcpu_on_spin(vcpu, vcpu_mode_priv(vcpu));
} else {
trace_kvm_wfx_arm64(*vcpu_pc(vcpu), false);
vcpu->stat.wfi_exit_stat++;
+ if (vcpu->kvm->arch.exit_on_wfx) {
+ vcpu->run->exit_reason = KVM_EXIT_WFX;
+ vcpu->run->wfx.esr = kvm_vcpu_get_esr(vcpu);
+ return 0;
+ }
kvm_vcpu_block(vcpu);
kvm_clear_request(KVM_REQ_UNHALT, vcpu);
}
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 0d47e07f4..155dc7eab 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -269,6 +269,7 @@ struct kvm_xen_exit {
#define KVM_EXIT_AP_RESET_HOLD 32
#define KVM_EXIT_X86_BUS_LOCK 33
#define KVM_EXIT_XEN 34
+#define KVM_EXIT_WFX 35

/* For KVM_EXIT_INTERNAL_ERROR */
/* Emulate instruction failed. */
@@ -469,6 +470,11 @@ struct kvm_run {
} msr;
/* KVM_EXIT_XEN */
struct kvm_xen_exit xen;
+ /* KVM_EXIT_WFX */
+ struct {
+ __u64 esr;
+ } wfx;
+
/* Fix the size of the union. */
char padding[256];
};
@@ -1123,6 +1129,7 @@ struct kvm_ppc_resize_hpt {
#define KVM_CAP_XSAVE2 208
#define KVM_CAP_SYS_ATTRIBUTES 209
#define KVM_CAP_S390_MEM_OP_EXTENSION 211
+#define KVM_CAP_EXIT_ON_WFX 222

#ifdef KVM_CAP_IRQ_ROUTING