Re: [RFC PATCH v4 0/2] Deprecate BUG() in pte_list_remove() in shadow mmu

From: David Matlack
Date: Mon Dec 12 2022 - 19:10:07 EST


On Mon, Dec 12, 2022 at 8:46 AM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
>
> On Mon, Dec 12, 2022, Mingwei Zhang wrote:
> > On Fri, Dec 09, 2022, David Matlack wrote:
> > > On Tue, Nov 29, 2022 at 07:12:35PM +0000, Mingwei Zhang wrote:
> > > > Deprecate BUG() in pte_list_remove() in shadow mmu to avoid crashing a
> > > > physical machine. There are several reasons and motivations to do so:
> > > >
> > > > MMU bug is difficult to discover due to various racing conditions and
> > > > corner cases and thus it extremely hard to debug. The situation gets much
> > > > worse when it triggers the shutdown of a host. Host machine crash might
> > > > eliminates everything including the potential clues for debugging.
> > > >
> > > > From cloud computing service perspective, BUG() or BUG_ON() is probably no
> > > > longer appropriate as the host reliability is top priority. Crashing the
> > > > physical machine is almost never a good option as it eliminates innocent
> > > > VMs and cause service outage in a larger scope. Even worse, if attacker can
> > > > reliably triggers this code by diverting the control flow or corrupting the
> > > > memory, then this becomes vm-of-death attack. This is a huge attack vector
> > > > to cloud providers, as the death of one single host machine is not the end
> > > > of the story. Without manual interferences, a failed cloud job may be
> > > > dispatched to other hosts and continue host crashes until all of them are
> > > > dead.
> > >
> > > My only concern with using KVM_BUG() is whether the machine can keep
> > > running correctly after this warning has been hit. In other words, are
> > > we sure the damage is contained to just this VM?
>
> Hmm, good point. The counter-argument is that KVM doesn't BUG() in get_mmio_spte()
> when a non-MMIO SPTE has reserved bits set, and as we've seen internally in multiple
> splats where the reserved bits appear to be set by stack overflow, that has a much,
> much higher probability of being a symptom of data corruption.
>
> That said, that's more of a reason to change get_mmio_spte() than it is to ignore
> potential data corruption in this case. KVM arguably should kill the VM if
> get_mmio_spte() fails too.
>
> What about explicitly treating both get_mmio_spte() and this as potential data
> corruption? E.g. something like this, and then use the DATA_CORRUPTION variant
> in pte_list_remove()?
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 2055e04b8f89..1cb69c6d186b 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4075,6 +4075,7 @@ static bool get_mmio_spte(struct kvm_vcpu *vcpu, u64 addr, u64 *sptep)
> pr_err("------ spte = 0x%llx level = %d, rsvd bits = 0x%llx",
> sptes[level], level,
> get_rsvd_bits(rsvd_check, sptes[level], level));
> + KVM_BUG_ON_DATA_CORRUPTION(reserved, vcpu->kvm);
> }
>
> return reserved;
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index f16c4689322b..5c4a06f66f46 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -863,6 +863,17 @@ static inline void kvm_vm_bugged(struct kvm *kvm)
> unlikely(__ret); \
> })
>
> +#define KVM_BUG_ON_DATA_CORRUPTION(cond, kvm) \
> +({ \
> + int __ret = (cond); \
> + \
> + if (IS_ENABLED(CONFIG_BUG_ON_DATA_CORRUPTION)) \
> + BUG_ON(__ret); \
> + else if (WARN_ON_ONCE(__ret && !(kvm)->vm_bugged)) \
> + kvm_vm_bugged(kvm); \
> + unlikely(__ret); \
> +})
> +
> static inline void kvm_vcpu_srcu_read_lock(struct kvm_vcpu *vcpu)
> {
> #ifdef CONFIG_PROVE_RCU

That sounds reasonable to me.