Re: [PATCH v2 0/3] KVM: VMX: Fix for kexec VMCLEAR and VMXON cleanup

From: Baoquan He
Date: Tue Apr 07 2020 - 07:01:34 EST


On 03/21/20 at 12:37pm, Sean Christopherson wrote:
> Patch 1 fixes a a theoretical bug where a crashdump NMI that arrives
> while KVM is messing with the percpu VMCS list would result in one or more
> VMCSes not being cleared, potentially causing memory corruption in the new
> kexec'd kernel.

I am wondering if this theoretical bug really exists. Now CKI of Redhat
reported crash dumping failed on below commit. I reserved a
intel machine and can reproduce it always.

Commit: 5364abc57993 - Merge tag 'arc-5.7-rc1' of

>From failure trace, it's the kvm vmx which caused this. I have reverted
them and the failure disappeared.

4f6ea0a87608 KVM: VMX: Gracefully handle faults on VMXON
d260f9ef50c7 KVM: VMX: Fold loaded_vmcs_init() into alloc_loaded_vmcs()
31603d4fc2bb KVM: VMX: Always VMCLEAR in-use VMCSes during crash with kexec support

The trace is here.

[ 132.476387] sysrq: Trigger a crash
[ 132.479829] Kernel panic - not syncing: sysrq triggered crash
[ 132.480817] CPU: 4 PID: 1975 Comm: runtest.sh Kdump: loaded Not tainted 5.6.0-5364abc.cki #1
[ 132.480817] Hardware name: Dell Inc. Precision R7610/, BIOS A08 11/21/2014
[ 132.480817] Call Trace:
[ 132.480817] dump_stack+0x66/0x90
[ 132.480817] panic+0x101/0x2e3
[ 132.480817] ? printk+0x58/0x6f
[ 132.480817] sysrq_handle_crash+0x11/0x20
[ 132.480817] __handle_sysrq.cold+0x48/0x104
[ 132.480817] write_sysrq_trigger+0x24/0x40
[ 132.480817] proc_reg_write+0x3c/0x60
[ 132.480817] vfs_write+0xb6/0x1a0
[ 132.480817] ksys_write+0x5f/0xe0
[ 132.480817] do_syscall_64+0x5b/0x1c0
[ 132.480817] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 132.480817] RIP: 0033:0x7f205575d4b7
[ 132.480817] Code: 64 89 02 48 c7 c0 ff ff ff ff eb bb 0f 1f 80 00 00 00 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
[ 132.480817] RSP: 002b:00007ffe6ab44a38 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[ 132.480817] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f205575d4b7
[ 132.480817] RDX: 0000000000000002 RSI: 000055e2876e1da0 RDI: 0000000000000001
[ 132.480817] RBP: 000055e2876e1da0 R08: 000000000000000a R09: 0000000000000001
[ 132.480817] R10: 000055e2879bf930 R11: 0000000000000246 R12: 0000000000000002
[ 132.480817] R13: 00007f205582e500 R14: 0000000000000002 R15: 00007f205582e700
[ 132.480817] BUG: unable to handle page fault for address: ffffffffffffffc8
[ 132.480817] #PF: supervisor read access in kernel mode
[ 132.480817] #PF: error_code(0x0000) - not-present page
[ 132.480817] PGD 14460f067 P4D 14460f067 PUD 144611067 PMD 0
[ 132.480817] Oops: 0000 [#12] SMP PTI
[ 132.480817] CPU: 4 PID: 1975 Comm: runtest.sh Kdump: loaded Tainted: G D 5.6.0-5364abc.cki #1
[ 132.480817] Hardware name: Dell Inc. Precision R7610/, BIOS A08 11/21/2014
[ 132.480817] RIP: 0010:crash_vmclear_local_loaded_vmcss+0x57/0xd0 [kvm_intel]
[ 132.480817] Code: c7 c5 40 e0 02 00 4a 8b 04 f5 80 69 42 85 48 8b 14 28 48 01 e8 48 39 c2 74 4d 4c 8d 6a c8 bb 00 00 00 80 49 c7 c4 00 00 00 80 <49> 8b 7d 00 48 89 fe 48 01 de 72 5c 4c 89 e0 48 2b 05 13 a8 88 c4
[ 132.480817] RSP: 0018:ffffa85d435dfcb0 EFLAGS: 00010007
[ 132.480817] RAX: ffff9c82ebb2e040 RBX: 0000000080000000 RCX: 000000000000080f
[ 132.480817] RDX: 0000000000000000 RSI: 00000000000000ff RDI: 000000000000080f
[ 132.480817] RBP: 000000000002e040 R08: 000000717702c90c R09: 0000000000000004
[ 132.480817] R10: 0000000000000009 R11: 0000000000000000 R12: ffffffff80000000
[ 132.480817] R13: ffffffffffffffc8 R14: 0000000000000004 R15: 0000000000000000
[ 132.480817] FS: 00007f2055668740(0000) GS:ffff9c82ebb00000(0000) knlGS:0000000000000000
[ 132.480817] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 132.480817] CR2: ffffffffffffffc8 CR3: 0000000151904003 CR4: 00000000000606e0
[ 132.480817] Call Trace:
[ 132.480817] native_machine_crash_shutdown+0x45/0x190
[ 132.480817] __crash_kexec+0x61/0x130
[ 132.480817] ? __crash_kexec+0x9a/0x130
[ 132.480817] ? panic+0x11d/0x2e3
[ 132.480817] ? printk+0x58/0x6f
[ 132.480817] ? sysrq_handle_crash+0x11/0x20
[ 132.480817] ? __handle_sysrq.cold+0x48/0x104
[ 132.480817] ? write_sysrq_trigger+0x24/0x40
[ 132.480817] ? proc_reg_write+0x3c/0x60
[ 132.480817] ? vfs_write+0xb6/0x1a0
[ 132.480817] ? ksys_write+0x5f/0xe0
[ 132.480817] ? do_syscall_64+0x5b/0x1c0
[ 132.480817] ? entry_SYSCALL_64_after_hwframe+0x44/0xa9

>
> Patch 2 is cleanup that's made possible by patch 1.
>
> Patch 3 isn't directly related, but it conflicts with the crash cleanup
> changes, both from a code and a semantics perspective. Without the crash
> cleanup, IMO hardware_enable() should do crash_disable_local_vmclear()
> if VMXON fails, i.e. clean up after itself. But hardware_disable()
> doesn't even do crash_disable_local_vmclear() (which is what got me
> looking at that code in the first place). Basing the VMXON change on top
> of the crash cleanup avoids the debate entirely.
>
> v2:
> - Inverted the code flow, i.e. move code from loaded_vmcs_init() to
> __loaded_vmcs_clear(). Trying to share loaded_vmcs_init() with
> alloc_loaded_vmcs() was taking more code than it saved. [Paolo]
>
>
> Gory details on the crashdump bug:
>
> I verified my analysis of the NMI bug by simulating what would happen if
> an NMI arrived in the middle of list_add() and list_del(). The below
> output matches expectations, e.g. nothing hangs, the entry being added
> doesn't show up, and the entry being deleted _does_ show up.
>
> [ 8.205898] KVM: testing NMI in list_add()
> [ 8.205898] KVM: testing NMI in list_del()
> [ 8.205899] KVM: found e3
> [ 8.205899] KVM: found e2
> [ 8.205899] KVM: found e1
> [ 8.205900] KVM: found e3
> [ 8.205900] KVM: found e1
>
> static void vmx_test_list(struct list_head *list, struct list_head *e1,
> struct list_head *e2, struct list_head *e3)
> {
> struct list_head *tmp;
>
> list_for_each(tmp, list) {
> if (tmp == e1)
> pr_warn("KVM: found e1\n");
> else if (tmp == e2)
> pr_warn("KVM: found e2\n");
> else if (tmp == e3)
> pr_warn("KVM: found e3\n");
> else
> pr_warn("KVM: kaboom\n");
> }
> }
>
> static int __init vmx_init(void)
> {
> LIST_HEAD(list);
> LIST_HEAD(e1);
> LIST_HEAD(e2);
> LIST_HEAD(e3);
>
> pr_warn("KVM: testing NMI in list_add()\n");
>
> list.next->prev = &e1;
> vmx_test_list(&list, &e1, &e2, &e3);
>
> e1.next = list.next;
> vmx_test_list(&list, &e1, &e2, &e3);
>
> e1.prev = &list;
> vmx_test_list(&list, &e1, &e2, &e3);
>
> INIT_LIST_HEAD(&list);
> INIT_LIST_HEAD(&e1);
>
> list_add(&e1, &list);
> list_add(&e2, &list);
> list_add(&e3, &list);
>
> pr_warn("KVM: testing NMI in list_del()\n");
>
> e3.prev = &e1;
> vmx_test_list(&list, &e1, &e2, &e3);
>
> list_del(&e2);
> list.prev = &e1;
> vmx_test_list(&list, &e1, &e2, &e3);
> }
>
> Sean Christopherson (3):
> KVM: VMX: Always VMCLEAR in-use VMCSes during crash with kexec support
> KVM: VMX: Fold loaded_vmcs_init() into alloc_loaded_vmcs()
> KVM: VMX: Gracefully handle faults on VMXON
>
> arch/x86/kvm/vmx/vmx.c | 103 ++++++++++++++++-------------------------
> arch/x86/kvm/vmx/vmx.h | 1 -
> 2 files changed, 40 insertions(+), 64 deletions(-)
>
> --
> 2.24.1
>