Re: [RFC PATCH v4 0/2] Deprecate BUG() in pte_list_remove() in shadow mmu

From: Mingwei Zhang
Date: Sun Dec 11 2022 - 23:36:23 EST


On Fri, Dec 09, 2022, David Matlack wrote:
> On Tue, Nov 29, 2022 at 07:12:35PM +0000, Mingwei Zhang wrote:
> > Deprecate BUG() in pte_list_remove() in shadow mmu to avoid crashing a
> > physical machine. There are several reasons and motivations to do so:
> >
> > MMU bug is difficult to discover due to various racing conditions and
> > corner cases and thus it extremely hard to debug. The situation gets much
> > worse when it triggers the shutdown of a host. Host machine crash might
> > eliminates everything including the potential clues for debugging.
> >
> > From cloud computing service perspective, BUG() or BUG_ON() is probably no
> > longer appropriate as the host reliability is top priority. Crashing the
> > physical machine is almost never a good option as it eliminates innocent
> > VMs and cause service outage in a larger scope. Even worse, if attacker can
> > reliably triggers this code by diverting the control flow or corrupting the
> > memory, then this becomes vm-of-death attack. This is a huge attack vector
> > to cloud providers, as the death of one single host machine is not the end
> > of the story. Without manual interferences, a failed cloud job may be
> > dispatched to other hosts and continue host crashes until all of them are
> > dead.
>
> My only concern with using KVM_BUG() is whether the machine can keep
> running correctly after this warning has been hit. In other words, are
> we sure the damage is contained to just this VM?
>
> If, for example, the KVM_BUG() was triggered by a use-after-free, then
> there might be corrupted memory floating around in the machine.
>

David,

Your concern is quite reasonable. But given that both rmap and spte are
pointers/data structures managed by individual VMs, i.e., none of them
are global pointers, use-after-free is unlikely happening on cross-VM
cases. Even if there is, then shuting down those corrupted VMs is feasible
here, since pte_list_remove() basically does the checking.
> What are some instances where we've seen these BUG_ON()s get triggered?
> For those instances, would it actually be safe to just kill the current
> VM and keep the rest of the machine running?
>
> >
> > For the above reason, we propose the replacement of BUG() in
> > pte_list_remove() with KVM_BUG() to crash just the VM itself.
>
> How did you test this series?

I used a simple test case to test the series:

diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 0f6455072055..d4b993b26b96 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -701,7 +701,7 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
if (fault->nx_huge_page_workaround_enabled)
disallowed_hugepage_adjust(fault, *it.sptep, it.level);

- base_gfn = fault->gfn & ~(KVM_PAGES_PER_HPAGE(it.level) - 1);
+ base_gfn = fault->gfn & ~(KVM_PAGES_PER_HPAGE(it.level) - 1) - 1;
if (it.level == fault->goal_level)
break;

On the testing machine, I launched a L1 VM and a L2 VM within it. The L2
will trigger the above bug in shadow MMU and I got the following error
in L0 kernel dmesg as shown below. L1 and L2 hangs with high CPU usage
for a while and after a couple of seconds, the L1 VM dies properly. The
machine is still alive and subsequent VM operations are all good
(launch/kill).

[ 1678.043378] ------------[ cut here ]------------
[ 1678.043381] gfn mismatch under direct page 1041bf (expected 10437e, got 1043be)
[ 1678.043386] WARNING: CPU: 4 PID: 23430 at arch/x86/kvm/mmu/mmu.c:737 kvm_mmu_page_set_translation+0x131/0x140
[ 1678.043395] Modules linked in: kvm_intel vfat fat i2c_mux_pca954x i2c_mux spidev cdc_acm xhci_pci xhci_hcd sha3_generic gq(O)
[ 1678.043404] CPU: 4 PID: 23430 Comm: VCPU-7 Tainted: G S O 6.1.0-smp-DEV #5
[ 1678.043406] Hardware name: Google LLC Indus/Indus_QC_02, BIOS 30.12.6 02/14/2022
[ 1678.043407] RIP: 0010:kvm_mmu_page_set_translation+0x131/0x140
[ 1678.043411] Code: 0f 44 e0 4c 8b 6b 28 48 89 df 44 89 f6 e8 b7 fb ff ff 48 c7 c7 1b 5a 2f 82 4c 89 e6 4c 89 ea 48 89 c1 4d 89 f8 e8 9f 39 0c 00 <0f> 0b eb ac 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48
[ 1678.043413] RSP: 0018:ffff88811ba87918 EFLAGS: 00010246
[ 1678.043415] RAX: 1bdd851636664d00 RBX: ffff888118602e60 RCX: 0000000000000027
[ 1678.043416] RDX: 0000000000000002 RSI: c0000000ffff7fff RDI: ffff8897e0320488
[ 1678.043417] RBP: ffff88811ba87940 R08: 0000000000000000 R09: ffffffff82b2e6f0
[ 1678.043418] R10: 00000000ffff7fff R11: 0000000000000000 R12: ffffffff822e89da
[ 1678.043419] R13: 00000000001041bf R14: 00000000000001bf R15: 00000000001043be
[ 1678.043421] FS: 00007fee198ec700(0000) GS:ffff8897e0300000(0000) knlGS:0000000000000000
[ 1678.043422] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1678.043424] CR2: 0000000000000000 CR3: 0000001857c34005 CR4: 00000000003726e0
[ 1678.043425] Call Trace:
[ 1678.043426] <TASK>
[ 1678.043428] __rmap_add+0x8a/0x270
[ 1678.043432] mmu_set_spte+0x250/0x340
[ 1678.043435] ept_fetch+0x8ad/0xc00
[ 1678.043437] ept_page_fault+0x265/0x2f0
[ 1678.043440] kvm_mmu_page_fault+0xfa/0x2d0
[ 1678.043443] handle_ept_violation+0x135/0x2e0 [kvm_intel]
[ 1678.043455] ? handle_desc+0x20/0x20 [kvm_intel]
[ 1678.043462] __vmx_handle_exit+0x1c3/0x480 [kvm_intel]
[ 1678.043468] vmx_handle_exit+0x12/0x40 [kvm_intel]
[ 1678.043474] vcpu_enter_guest+0xbb3/0xf80
[ 1678.043477] ? complete_fast_pio_in+0xcc/0x160
[ 1678.043480] kvm_arch_vcpu_ioctl_run+0x3b0/0x770
[ 1678.043481] kvm_vcpu_ioctl+0x52d/0x610
[ 1678.043486] ? kvm_on_user_return+0x46/0xd0
[ 1678.043489] __se_sys_ioctl+0x77/0xc0
[ 1678.043492] __x64_sys_ioctl+0x1d/0x20
[ 1678.043493] do_syscall_64+0x3d/0x80
[ 1678.043497] ? sysvec_apic_timer_interrupt+0x49/0x90
[ 1678.043499] entry_SYSCALL_64_after_hwframe+0x63/0xcd
[ 1678.043501] RIP: 0033:0x7fee3ebf0347
[ 1678.043503] Code: 5d c3 cc 48 8b 05 f9 2f 07 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 cc cc cc cc cc cc cc cc cc cc b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d c9 2f 07 00 f7 d8 64 89 01 48
[ 1678.043505] RSP: 002b:00007fee198e8998 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[ 1678.043507] RAX: ffffffffffffffda RBX: 0000555308e7e4d0 RCX: 00007fee3ebf0347
[ 1678.043507] RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 00000000000000b0
[ 1678.043508] RBP: 00007fee198e89c0 R08: 000055530943d920 R09: 00000000000003fa
[ 1678.043509] R10: 0000555307349b00 R11: 0000000000000246 R12: 00000000000000b0
[ 1678.043510] R13: 00005574c1a1de88 R14: 00007fee198e8a27 R15: 0000000000000000
[ 1678.043511] </TASK>
[ 1678.043512] ---[ end trace 0000000000000000 ]---
[ 5313.657064] ------------[ cut here ]------------
[ 5313.657067] no rmap for 0000000071a2f138 (many->many)
[ 5313.657071] WARNING: CPU: 43 PID: 23398 at arch/x86/kvm/mmu/mmu.c:983 pte_list_remove+0x17a/0x190
[ 5313.657080] Modules linked in: kvm_intel vfat fat i2c_mux_pca954x i2c_mux spidev cdc_acm xhci_pci xhci_hcd sha3_generic gq(O)
[ 5313.657088] CPU: 43 PID: 23398 Comm: kvm-nx-lpage-re Tainted: G S W O 6.1.0-smp-DEV #5
[ 5313.657090] Hardware name: Google LLC Indus/Indus_QC_02, BIOS 30.12.6 02/14/2022
[ 5313.657092] RIP: 0010:pte_list_remove+0x17a/0x190
[ 5313.657095] Code: cf e4 01 01 48 c7 c7 4d 3c 32 82 e8 70 5e 0c 00 0f 0b e9 0a ff ff ff c6 05 d4 cf e4 01 01 48 c7 c7 9e de 33 82 e8 56 5e 0c 00 <0f> 0b 84 db 75 c8 e9 ec fe ff ff 66 66 2e 0f 1f 84 00 00 00 00 00
[ 5313.657097] RSP: 0018:ffff88986d5d3c30 EFLAGS: 00010246
[ 5313.657099] RAX: 1ebf71ba511d3100 RBX: 0000000000000000 RCX: 0000000000000027
[ 5313.657101] RDX: 0000000000000002 RSI: c0000000ffff7fff RDI: ffff88afdf3e0488
[ 5313.657102] RBP: ffff88986d5d3c40 R08: 0000000000000000 R09: ffffffff82b2e6f0
[ 5313.657104] R10: 00000000ffff7fff R11: 40000000ffff8a28 R12: 0000000000000000
[ 5313.657105] R13: ffff888118602000 R14: ffffc90020e1e000 R15: ffff88815df33030
[ 5313.657106] FS: 0000000000000000(0000) GS:ffff88afdf3c0000(0000) knlGS:0000000000000000
[ 5313.657107] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 5313.657109] CR2: 000017c92b50f1b8 CR3: 000000006f40a001 CR4: 00000000003726e0
[ 5313.657110] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 5313.657111] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 5313.657112] Call Trace:
[ 5313.657113] <TASK>
[ 5313.657114] drop_spte+0x175/0x180
[ 5313.657117] mmu_page_zap_pte+0xfd/0x130
[ 5313.657119] __kvm_mmu_prepare_zap_page+0x290/0x6e0
[ 5313.657122] ? newidle_balance+0x228/0x3b0
[ 5313.657126] kvm_nx_huge_page_recovery_worker+0x266/0x360
[ 5313.657129] kvm_vm_worker_thread+0x93/0x150
[ 5313.657134] ? kvm_mmu_post_init_vm+0x40/0x40
[ 5313.657136] ? kvm_vm_create_worker_thread+0x120/0x120
[ 5313.657139] kthread+0x10d/0x120
[ 5313.657141] ? kthread_blkcg+0x30/0x30
[ 5313.657142] ret_from_fork+0x1f/0x30
[ 5313.657156] </TASK>
[ 5313.657156] ---[ end trace 0000000000000000 ]---