Re: [RFC PATCH v4 0/2] Deprecate BUG() in pte_list_remove() in shadow mmu

From: Mingwei Zhang
Date: Mon Dec 12 2022 - 20:40:04 EST

Next message: pr-tracker-bot: "Re: [GIT PULL] non-MM commits for 6.2-rc1"
Previous message: David Sterba: "Re: [GIT PULL] Btrfs updates for 6.2 (updated merge log)"
In reply to: David Matlack: "Re: [RFC PATCH v4 0/2] Deprecate BUG() in pte_list_remove() in shadow mmu"
Next in thread: Sean Christopherson: "Re: [RFC PATCH v4 0/2] Deprecate BUG() in pte_list_remove() in shadow mmu"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Mon, Dec 12, 2022, David Matlack wrote:
> On Mon, Dec 12, 2022 at 8:46 AM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
> >
> > On Mon, Dec 12, 2022, Mingwei Zhang wrote:
> > > On Fri, Dec 09, 2022, David Matlack wrote:
> > > > On Tue, Nov 29, 2022 at 07:12:35PM +0000, Mingwei Zhang wrote:
> > > > > Deprecate BUG() in pte_list_remove() in shadow mmu to avoid crashing a
> > > > > physical machine. There are several reasons and motivations to do so:
> > > > >
> > > > > MMU bug is difficult to discover due to various racing conditions and
> > > > > corner cases and thus it extremely hard to debug. The situation gets much
> > > > > worse when it triggers the shutdown of a host. Host machine crash might
> > > > > eliminates everything including the potential clues for debugging.
> > > > >
> > > > > From cloud computing service perspective, BUG() or BUG_ON() is probably no
> > > > > longer appropriate as the host reliability is top priority. Crashing the
> > > > > physical machine is almost never a good option as it eliminates innocent
> > > > > VMs and cause service outage in a larger scope. Even worse, if attacker can
> > > > > reliably triggers this code by diverting the control flow or corrupting the
> > > > > memory, then this becomes vm-of-death attack. This is a huge attack vector
> > > > > to cloud providers, as the death of one single host machine is not the end
> > > > > of the story. Without manual interferences, a failed cloud job may be
> > > > > dispatched to other hosts and continue host crashes until all of them are
> > > > > dead.
> > > >
> > > > My only concern with using KVM_BUG() is whether the machine can keep
> > > > running correctly after this warning has been hit. In other words, are
> > > > we sure the damage is contained to just this VM?
> >
> > Hmm, good point. The counter-argument is that KVM doesn't BUG() in get_mmio_spte()
> > when a non-MMIO SPTE has reserved bits set, and as we've seen internally in multiple
> > splats where the reserved bits appear to be set by stack overflow, that has a much,
> > much higher probability of being a symptom of data corruption.
> >
> > That said, that's more of a reason to change get_mmio_spte() than it is to ignore
> > potential data corruption in this case. KVM arguably should kill the VM if
> > get_mmio_spte() fails too.
> >
> > What about explicitly treating both get_mmio_spte() and this as potential data
> > corruption? E.g. something like this, and then use the DATA_CORRUPTION variant
> > in pte_list_remove()?
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 2055e04b8f89..1cb69c6d186b 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -4075,6 +4075,7 @@ static bool get_mmio_spte(struct kvm_vcpu *vcpu, u64 addr, u64 *sptep)
> > pr_err("------ spte = 0x%llx level = %d, rsvd bits = 0x%llx",
> > sptes[level], level,
> > get_rsvd_bits(rsvd_check, sptes[level], level));
> > + KVM_BUG_ON_DATA_CORRUPTION(reserved, vcpu->kvm);
> > }
> >
> > return reserved;
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index f16c4689322b..5c4a06f66f46 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -863,6 +863,17 @@ static inline void kvm_vm_bugged(struct kvm *kvm)
> > unlikely(__ret); \
> > })
> >
> > +#define KVM_BUG_ON_DATA_CORRUPTION(cond, kvm) \
> > +({ \
> > + int __ret = (cond); \
> > + \
> > + if (IS_ENABLED(CONFIG_BUG_ON_DATA_CORRUPTION)) \
> > + BUG_ON(__ret); \
> > + else if (WARN_ON_ONCE(__ret && !(kvm)->vm_bugged)) \
> > + kvm_vm_bugged(kvm); \
> > + unlikely(__ret); \
> > +})
> > +
> > static inline void kvm_vcpu_srcu_read_lock(struct kvm_vcpu *vcpu)
> > {
> > #ifdef CONFIG_PROVE_RCU
>
> That sounds reasonable to me.

Actually, I disagree after thinking about it for a while. Since
Google turns on CONFIG_BUG_ON_DATA_CORRUPTION on default, this
KVM_BUG_ON_DATA_CORRUPTION() is literally BUG_ON(). Then there is no
point. The purpose of the series is to save the host from crash.

If we follow the upper changes on get_mmio_spte() as well, we are
introducing more BUG()!, ie., more chances to crash the host machine. So
I cannot stay with this idea.

In fact, there could many reasons why RMAPs and SPTEs are messed. In
many times, as illustrated by my example, they are not data corruptions.
They are just errors due to code refactoring or some miscalculations
under certain tricky corner situations, e.g., offset by 1. So, treating
them blindly as data corruptions is an overkill. This is a comparison to
the list traversal bug, which clearly shows data corruptions.

So, alternatively, I think it should be reasonable to shutdown the KVM
instance if we see things messed up in MMIO SPTEs as well, but please
don not the host for that.

Next message: pr-tracker-bot: "Re: [GIT PULL] non-MM commits for 6.2-rc1"
Previous message: David Sterba: "Re: [GIT PULL] Btrfs updates for 6.2 (updated merge log)"
In reply to: David Matlack: "Re: [RFC PATCH v4 0/2] Deprecate BUG() in pte_list_remove() in shadow mmu"
Next in thread: Sean Christopherson: "Re: [RFC PATCH v4 0/2] Deprecate BUG() in pte_list_remove() in shadow mmu"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]