Re: [PATCH v2] kvm: nVMX: Introduce KVM_CAP_STATE

From: Raslan, KarimAllah
Date: Tue Apr 10 2018 - 03:37:30 EST

Next message: Quentin Schulz: "Re: [PATCH v2 00/10] Add support for SAMA5D2 touchscreen"
Previous message: Jia-Ju Bai: "[PATCH 2/2] usb: isp1760: Replace mdelay with msleep in isp1760_stop"
In reply to: David Hildenbrand: "Re: [PATCH v2] kvm: nVMX: Introduce KVM_CAP_STATE"
Next in thread: Jim Mattson: "Re: [PATCH v2] kvm: nVMX: Introduce KVM_CAP_STATE"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Mon, 2018-04-09 at 13:26 +0200, David Hildenbrand wrote:
> On 09.04.2018 10:37, KarimAllah Ahmed wrote:
> >
> > From: Jim Mattson <jmattson@xxxxxxxxxx>
> >
> > For nested virtualization L0 KVM is managing a bit of state for L2 guests,
> > this state can not be captured through the currently available IOCTLs. In
> > fact the state captured through all of these IOCTLs is usually a mix of L1
> > and L2 state. It is also dependent on whether the L2 guest was running at
> > the moment when the process was interrupted to save its state.
> >
> > With this capability, there are two new vcpu ioctls: KVM_GET_VMX_STATE and
> > KVM_SET_VMX_STATE. These can be used for saving and restoring a VM that is
> > in VMX operation.
> >
>
> Very nice work!
>
> >
> >
> > +static int get_vmcs_cache(struct kvm_vcpu *vcpu,
> > + struct kvm_state __user *user_kvm_state)
> > +{
> > + struct vcpu_vmx *vmx = to_vmx(vcpu);
> > + struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
> > +
> > + /*
> > + * When running L2, the authoritative vmcs12 state is in the
> > + * vmcs02. When running L1, the authoritative vmcs12 state is
> > + * in the shadow vmcs linked to vmcs01, unless
> > + * sync_shadow_vmcs is set, in which case, the authoritative
> > + * vmcs12 state is in the vmcs12 already.
> > + */
> > + if (is_guest_mode(vcpu))
> > + sync_vmcs12(vcpu, vmcs12);
> > + else if (enable_shadow_vmcs && !vmx->nested.sync_shadow_vmcs)
> > + copy_shadow_to_vmcs12(vmx);
> > +
> > + if (copy_to_user(user_kvm_state->data, vmcs12, sizeof(*vmcs12)))
> > + return -EFAULT;
> > +
> > + /*
> > + * Force a nested exit that guarantees that any state capture
> > + * afterwards by any IOCTLs (MSRs, etc) will not capture a mix of L1
> > + * and L2 state.> + *
>
> I totally understand why this is nice, I am worried about the
> implications. Let's assume migration fails and we want to continue
> running the guest on the source. We would now have a "bad" state.
>
> How is this to be handled (e.g. is a SET_STATE necessary?)? I think this
> implication should be documented for KVM_GET_STATE.

Yup, I SET_STATE will be needed. That being said, I guess I will doÂ
what Jim mentioned and just fix the issue outlined here and then I canÂ
remove this VMExit.

> >
> > + * One example where that would lead to an issue is the TSC DEADLINE
> > + * MSR vs the guest TSC. If the L2 guest is running, the guest TSC will
> > + * be the L2 TSC while the TSC deadline MSR will contain the L1 TSC
> > + * deadline MSR. That would lead to a very large (and wrong) "expire"
> > + * diff when LAPIC is initialized during instance restore (i.e. the
> > + * instance will appear to have hanged!).
> > + */
> > + if (is_guest_mode(vcpu))
> > + nested_vmx_vmexit(vcpu, -1, 0, 0);
> > +
> > + return 0;
> > +}
> > +
> > +static int get_vmx_state(struct kvm_vcpu *vcpu,
> > + struct kvm_state __user *user_kvm_state)
> > +{
> > + u32 user_data_size;
> > + struct vcpu_vmx *vmx = to_vmx(vcpu);
> > + struct kvm_state kvm_state = {
> > + .flags = 0,
> > + .format = 0,
> > + .size = sizeof(kvm_state),
> > + .vmx.vmxon_pa = -1ull,
> > + .vmx.vmcs_pa = -1ull,
> > + };
> > +
> > + if (copy_from_user(&user_data_size, &user_kvm_state->size,
> > + sizeof(user_data_size)))
> > + return -EFAULT;
> > +
> > + if (nested_vmx_allowed(vcpu) && vmx->nested.vmxon) {
> > + kvm_state.vmx.vmxon_pa = vmx->nested.vmxon_ptr;
> > + kvm_state.vmx.vmcs_pa = vmx->nested.current_vmptr;
> > +
> > + if (vmx->nested.current_vmptr != -1ull)
> > + kvm_state.size += VMCS12_SIZE;
> > +
> > + if (is_guest_mode(vcpu)) {
> > + kvm_state.flags |= KVM_STATE_GUEST_MODE;
> > +
> > + if (vmx->nested.nested_run_pending)
> > + kvm_state.flags |= KVM_STATE_RUN_PENDING;
> > + }
> > + }
> > +
> > + if (user_data_size < kvm_state.size) {
> > + if (copy_to_user(&user_kvm_state->size, &kvm_state.size,
> > + sizeof(kvm_state.size)))
> > + return -EFAULT;
> > + return -E2BIG;
> > + }
> > +
> > + if (copy_to_user(user_kvm_state, &kvm_state, sizeof(kvm_state)))
> > + return -EFAULT;
> > +
> > + if (vmx->nested.current_vmptr == -1ull)
> > + return 0;
> > +
> > + return get_vmcs_cache(vcpu, user_kvm_state);
> > +}
> > +
> > +static int set_vmcs_cache(struct kvm_vcpu *vcpu,
> > + struct kvm_state __user *user_kvm_state,
> > + struct kvm_state *kvm_state)
> > +
> > +{
> > + struct vcpu_vmx *vmx = to_vmx(vcpu);
> > + struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
> > + u32 exit_qual;
> > + int ret;
> > +
> > + if ((kvm_state->size < (sizeof(*vmcs12) + sizeof(*kvm_state))) ||
> > + kvm_state->vmx.vmcs_pa == kvm_state->vmx.vmxon_pa ||
> > + !page_address_valid(vcpu, kvm_state->vmx.vmcs_pa))
> > + return -EINVAL;
> > +
> > + if (copy_from_user(vmcs12, user_kvm_state->data, sizeof(*vmcs12)))
> > + return -EFAULT;
> > +
> > + if (vmcs12->revision_id != VMCS12_REVISION)
> > + return -EINVAL;
> > +
> > + set_current_vmptr(vmx, kvm_state->vmx.vmcs_pa);
> > +
> > + if (!(kvm_state->flags & KVM_STATE_GUEST_MODE))
> > + return 0;
> > +
> > + if (check_vmentry_prereqs(vcpu, vmcs12) ||
> > + check_vmentry_postreqs(vcpu, vmcs12, &exit_qual))
> > + return -EINVAL;
> > +
> > + ret = enter_vmx_non_root_mode(vcpu, true);
> > + if (ret)
> > + return ret;
> > +
> > + /*
> > + * This request will result in a call to
> > + * nested_get_vmcs12_pages before the next VM-entry.
> > + */
> > + kvm_make_request(KVM_REQ_GET_VMCS12_PAGES, vcpu);
>
> Can you elaborate (+document) why this is needed instead of trying to
> get the page right away?

Because at this point, the MMU is not initialized to point at the
right entities yet and "get pages" would need to read data from theÂ
guest (i.e. we will need to performance gpa to hpa translation).

Ack! will update the comment in v3.

Thank you!

>
> Thanks!
>
Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B

Next message: Quentin Schulz: "Re: [PATCH v2 00/10] Add support for SAMA5D2 touchscreen"
Previous message: Jia-Ju Bai: "[PATCH 2/2] usb: isp1760: Replace mdelay with msleep in isp1760_stop"
In reply to: David Hildenbrand: "Re: [PATCH v2] kvm: nVMX: Introduce KVM_CAP_STATE"
Next in thread: Jim Mattson: "Re: [PATCH v2] kvm: nVMX: Introduce KVM_CAP_STATE"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]