Re: [PATCH v5 3/3] KVM: nVMX: Fix mmu context after VMLAUNCH/VMRESUME failure

From: Wanpeng Li
Date: Wed Nov 08 2017 - 19:37:20 EST


2017-11-09 5:47 GMT+08:00 Jim Mattson <jmattson@xxxxxxxxxx>:
> I realize now that there are actually many other problems with
> deferring some control field checks to the hardware VM-entry of
> vmcs02. When there is an invalid control field, the vCPU should just
> fall through to the next instruction, without any state modifiation
> other than the ALU flags and the VM-instruction error field of the
> current VMCS. However, in preparation for the hardware VM-entry of
> vmcs02, we have already changed quite a bit of the vCPU state: the
> MSRs on the VM-entry MSR-load list, DR7, IA32_DEBUGCTL, the entire
> FLAGS register, etc. All of these changes should be undone, and we're
> not prepared to do that. (For instance, what was the old DR7 value
> that needs to be restored?)

I didn't observe real issue currently, and I hope this patchset can
catch the upcoming merge window. Then we can dig more into your
concern.

Regards,
Wanpeng Li

>
> On Fri, Nov 3, 2017 at 5:07 PM, Krish Sadhukhan
> <krish.sadhukhan@xxxxxxxxxx> wrote:
>> On 11/02/2017 05:50 PM, Wanpeng Li wrote:
>>
>>> From: Wanpeng Li <wanpeng.li@xxxxxxxxxxx>
>>>
>>> Commit 4f350c6dbcb (kvm: nVMX: Handle deferred early VMLAUNCH/VMRESUME
>>> failure
>>> properly) can result in L1(run kvm-unit-tests/run_tests.sh vmx_controls in
>>> L1)
>>> null pointer deference and also L0 calltrace when EPT=0 on both L0 and L1.
>>>
>>> In L1:
>>>
>>> BUG: unable to handle kernel paging request at ffffffffc015bf8f
>>> IP: vmx_vcpu_run+0x202/0x510 [kvm_intel]
>>> PGD 146e13067 P4D 146e13067 PUD 146e15067 PMD 3d2686067 PTE 3d4af9161
>>> Oops: 0003 [#1] PREEMPT SMP
>>> CPU: 2 PID: 1798 Comm: qemu-system-x86 Not tainted 4.14.0-rc4+ #6
>>> RIP: 0010:vmx_vcpu_run+0x202/0x510 [kvm_intel]
>>> Call Trace:
>>> WARNING: kernel stack frame pointer at ffffb86f4988bc18 in
>>> qemu-system-x86:1798 has bad value 0000000000000002
>>>
>>> In L0:
>>>
>>> -----------[ cut here ]------------
>>> WARNING: CPU: 6 PID: 4460 at /home/kernel/linux/arch/x86/kvm//vmx.c:9845
>>> vmx_inject_page_fault_nested+0x130/0x140 [kvm_intel]
>>> CPU: 6 PID: 4460 Comm: qemu-system-x86 Tainted: G OE
>>> 4.14.0-rc7+ #25
>>> RIP: 0010:vmx_inject_page_fault_nested+0x130/0x140 [kvm_intel]
>>> Call Trace:
>>> paging64_page_fault+0x500/0xde0 [kvm]
>>> ? paging32_gva_to_gpa_nested+0x120/0x120 [kvm]
>>> ? nonpaging_page_fault+0x3b0/0x3b0 [kvm]
>>> ? __asan_storeN+0x12/0x20
>>> ? paging64_gva_to_gpa+0xb0/0x120 [kvm]
>>> ? paging64_walk_addr_generic+0x11a0/0x11a0 [kvm]
>>> ? lock_acquire+0x2c0/0x2c0
>>> ? vmx_read_guest_seg_ar+0x97/0x100 [kvm_intel]
>>> ? vmx_get_segment+0x2a6/0x310 [kvm_intel]
>>> ? sched_clock+0x1f/0x30
>>> ? check_chain_key+0x137/0x1e0
>>> ? __lock_acquire+0x83c/0x2420
>>> ? kvm_multiple_exception+0xf2/0x220 [kvm]
>>> ? debug_check_no_locks_freed+0x240/0x240
>>> ? debug_smp_processor_id+0x17/0x20
>>> ? __lock_is_held+0x9e/0x100
>>> kvm_mmu_page_fault+0x90/0x180 [kvm]
>>> kvm_handle_page_fault+0x15c/0x310 [kvm]
>>> ? __lock_is_held+0x9e/0x100
>>> handle_exception+0x3c7/0x4d0 [kvm_intel]
>>> vmx_handle_exit+0x103/0x1010 [kvm_intel]
>>> ? kvm_arch_vcpu_ioctl_run+0x1628/0x2e20 [kvm]
>>>
>>> The commit avoids to load host state of vmcs12 as vmcs01's guest state
>>> since vmcs12 is not modified (except for the VM-instruction error field)
>>> if the checking of vmcs control area fails. However, the mmu context is
>>> switched to nested mmu in prepare_vmcs02() and it will not be reloaded
>>> since load_vmcs12_host_state() is skipped when nested VMLAUNCH/VMRESUME
>>> fails. This patch fixes it by reloading mmu context when nested
>>> VMLAUNCH/VMRESUME fails.
>>>
>>> Cc: Paolo Bonzini <pbonzini@xxxxxxxxxx>
>>> Cc: Radim KrÄmÃÅ <rkrcmar@xxxxxxxxxx>
>>> Cc: Jim Mattson <jmattson@xxxxxxxxxx>
>>> Signed-off-by: Wanpeng Li <wanpeng.li@xxxxxxxxxxx>
>>> ---
>>> v3 -> v4:
>>> * move it to a new function load_vmcs12_mmu_host_state
>>>
>>> arch/x86/kvm/vmx.c | 34 ++++++++++++++++++++++------------
>>> 1 file changed, 22 insertions(+), 12 deletions(-)
>>>
>>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>>> index 6cf3972..8aefb91 100644
>>> --- a/arch/x86/kvm/vmx.c
>>> +++ b/arch/x86/kvm/vmx.c
>>> @@ -11259,6 +11259,24 @@ static void prepare_vmcs12(struct kvm_vcpu *vcpu,
>>> struct vmcs12 *vmcs12,
>>> kvm_clear_interrupt_queue(vcpu);
>>> }
>>> +static void load_vmcs12_mmu_host_state(struct kvm_vcpu *vcpu,
>>> + struct vmcs12 *vmcs12)
>>> +{
>>> + u32 entry_failure_code;
>>> +
>>> + nested_ept_uninit_mmu_context(vcpu);
>>> +
>>> + /*
>>> + * Only PDPTE load can fail as the value of cr3 was checked on
>>> entry and
>>> + * couldn't have changed.
>>> + */
>>> + if (nested_vmx_load_cr3(vcpu, vmcs12->host_cr3, false,
>>> &entry_failure_code))
>>> + nested_vmx_abort(vcpu, VMX_ABORT_LOAD_HOST_PDPTE_FAIL);
>>> +
>>> + if (!enable_ept)
>>> + vcpu->arch.walk_mmu->inject_page_fault =
>>> kvm_inject_page_fault;
>>> +}
>>> +
>>> /*
>>> * A part of what we need to when the nested L2 guest exits and we want
>>> to
>>> * run its L1 parent, is to reset L1's guest state to the host state
>>> specified
>>> @@ -11272,7 +11290,6 @@ static void load_vmcs12_host_state(struct kvm_vcpu
>>> *vcpu,
>>> struct vmcs12 *vmcs12)
>>> {
>>> struct kvm_segment seg;
>>> - u32 entry_failure_code;
>>> if (vmcs12->vm_exit_controls & VM_EXIT_LOAD_IA32_EFER)
>>> vcpu->arch.efer = vmcs12->host_ia32_efer;
>>> @@ -11299,17 +11316,7 @@ static void load_vmcs12_host_state(struct
>>> kvm_vcpu *vcpu,
>>> vcpu->arch.cr4_guest_owned_bits =
>>> ~vmcs_readl(CR4_GUEST_HOST_MASK);
>>> vmx_set_cr4(vcpu, vmcs12->host_cr4);
>>> - nested_ept_uninit_mmu_context(vcpu);
>>> -
>>> - /*
>>> - * Only PDPTE load can fail as the value of cr3 was checked on
>>> entry and
>>> - * couldn't have changed.
>>> - */
>>> - if (nested_vmx_load_cr3(vcpu, vmcs12->host_cr3, false,
>>> &entry_failure_code))
>>> - nested_vmx_abort(vcpu, VMX_ABORT_LOAD_HOST_PDPTE_FAIL);
>>> -
>>> - if (!enable_ept)
>>> - vcpu->arch.walk_mmu->inject_page_fault =
>>> kvm_inject_page_fault;
>>> + load_vmcs12_mmu_host_state(vcpu, vmcs12);
>>> if (enable_vpid) {
>>> /*
>>> @@ -11539,6 +11546,9 @@ static void nested_vmx_vmexit(struct kvm_vcpu
>>> *vcpu, u32 exit_reason,
>>> * accordingly.
>>> */
>>> nested_vmx_failValid(vcpu, VMXERR_ENTRY_INVALID_CONTROL_FIELD);
>>> +
>>> + load_vmcs12_mmu_host_state(vcpu, vmcs12);
>>> +
>>> /*
>>> * The emulated instruction was already skipped in
>>> * nested_vmx_run, but the updated RIP was never
>>
>> Reviewed-by: Krish Sadhukhan <krish.sadhukhan@xxxxxxxxxx>