Re: [RFC PATCH v4 0/2] Deprecate BUG() in pte_list_remove() in shadow mmu

From: Sean Christopherson
Date: Mon Dec 12 2022 - 11:46:34 EST


On Mon, Dec 12, 2022, Mingwei Zhang wrote:
> On Fri, Dec 09, 2022, David Matlack wrote:
> > On Tue, Nov 29, 2022 at 07:12:35PM +0000, Mingwei Zhang wrote:
> > > Deprecate BUG() in pte_list_remove() in shadow mmu to avoid crashing a
> > > physical machine. There are several reasons and motivations to do so:
> > >
> > > MMU bug is difficult to discover due to various racing conditions and
> > > corner cases and thus it extremely hard to debug. The situation gets much
> > > worse when it triggers the shutdown of a host. Host machine crash might
> > > eliminates everything including the potential clues for debugging.
> > >
> > > From cloud computing service perspective, BUG() or BUG_ON() is probably no
> > > longer appropriate as the host reliability is top priority. Crashing the
> > > physical machine is almost never a good option as it eliminates innocent
> > > VMs and cause service outage in a larger scope. Even worse, if attacker can
> > > reliably triggers this code by diverting the control flow or corrupting the
> > > memory, then this becomes vm-of-death attack. This is a huge attack vector
> > > to cloud providers, as the death of one single host machine is not the end
> > > of the story. Without manual interferences, a failed cloud job may be
> > > dispatched to other hosts and continue host crashes until all of them are
> > > dead.
> >
> > My only concern with using KVM_BUG() is whether the machine can keep
> > running correctly after this warning has been hit. In other words, are
> > we sure the damage is contained to just this VM?

Hmm, good point. The counter-argument is that KVM doesn't BUG() in get_mmio_spte()
when a non-MMIO SPTE has reserved bits set, and as we've seen internally in multiple
splats where the reserved bits appear to be set by stack overflow, that has a much,
much higher probability of being a symptom of data corruption.

That said, that's more of a reason to change get_mmio_spte() than it is to ignore
potential data corruption in this case. KVM arguably should kill the VM if
get_mmio_spte() fails too.

What about explicitly treating both get_mmio_spte() and this as potential data
corruption? E.g. something like this, and then use the DATA_CORRUPTION variant
in pte_list_remove()?

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 2055e04b8f89..1cb69c6d186b 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4075,6 +4075,7 @@ static bool get_mmio_spte(struct kvm_vcpu *vcpu, u64 addr, u64 *sptep)
pr_err("------ spte = 0x%llx level = %d, rsvd bits = 0x%llx",
sptes[level], level,
get_rsvd_bits(rsvd_check, sptes[level], level));
+ KVM_BUG_ON_DATA_CORRUPTION(reserved, vcpu->kvm);
}

return reserved;
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index f16c4689322b..5c4a06f66f46 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -863,6 +863,17 @@ static inline void kvm_vm_bugged(struct kvm *kvm)
unlikely(__ret); \
})

+#define KVM_BUG_ON_DATA_CORRUPTION(cond, kvm) \
+({ \
+ int __ret = (cond); \
+ \
+ if (IS_ENABLED(CONFIG_BUG_ON_DATA_CORRUPTION)) \
+ BUG_ON(__ret); \
+ else if (WARN_ON_ONCE(__ret && !(kvm)->vm_bugged)) \
+ kvm_vm_bugged(kvm); \
+ unlikely(__ret); \
+})
+
static inline void kvm_vcpu_srcu_read_lock(struct kvm_vcpu *vcpu)
{
#ifdef CONFIG_PROVE_RCU


> > If, for example, the KVM_BUG() was triggered by a use-after-free, then
> > there might be corrupted memory floating around in the machine.
> >
>
> David,
>
> Your concern is quite reasonable. But given that both rmap and spte are
> pointers/data structures managed by individual VMs, i.e., none of them
> are global pointers, use-after-free is unlikely happening on cross-VM
> cases.

Being per-VM allocations doesn't change the behavior/impact of use-after-free.
E.g. if there is no rmap found for a SPTE then there's a non-zero chance KVM has
previously zapped the SPTE and freed the memory the SPTE pointed at, and thus KVM
might be reading/writing memory that is now owned by something else in the kernel.

> Even if there is, then shuting down those corrupted VMs is feasible
> here, since pte_list_remove() basically does the checking.

But the damage may already be done. And the KVM_REQ_VM_DEAD request wont't be
recognized until the next vcpu_enter_enter_guest(), e.g. it won't prevent vCPUs
(or even this vCPU) from processing more