Re: [RFC PATCH v4 0/2] Deprecate BUG() in pte_list_remove() in shadow mmu

From: Sean Christopherson
Date: Mon Dec 12 2022 - 23:12:40 EST


On Tue, Dec 13, 2022, Mingwei Zhang wrote:
> On Mon, Dec 12, 2022, David Matlack wrote:
> > > What about explicitly treating both get_mmio_spte() and this as potential data
> > > corruption? E.g. something like this, and then use the DATA_CORRUPTION variant
> > > in pte_list_remove()?
> > >
> > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > index 2055e04b8f89..1cb69c6d186b 100644
> > > --- a/arch/x86/kvm/mmu/mmu.c
> > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > @@ -4075,6 +4075,7 @@ static bool get_mmio_spte(struct kvm_vcpu *vcpu, u64 addr, u64 *sptep)
> > > pr_err("------ spte = 0x%llx level = %d, rsvd bits = 0x%llx",
> > > sptes[level], level,
> > > get_rsvd_bits(rsvd_check, sptes[level], level));
> > > + KVM_BUG_ON_DATA_CORRUPTION(reserved, vcpu->kvm);

Random aside, this would be better:

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 99c40617d325..d9c46f2a6748 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4090,7 +4090,7 @@ static int handle_mmio_page_fault(struct kvm_vcpu *vcpu, u64 addr, bool direct)
return RET_PF_EMULATE;

reserved = get_mmio_spte(vcpu, addr, &spte);
- if (WARN_ON(reserved))
+ if (KVM_BUG_ON_DATA_CORRUPTION(reserved, vcpu->kvm))
return -EINVAL;

if (is_mmio_spte(spte)) {

> > > }
> > >
> > > return reserved;
> > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > > index f16c4689322b..5c4a06f66f46 100644
> > > --- a/include/linux/kvm_host.h
> > > +++ b/include/linux/kvm_host.h
> > > @@ -863,6 +863,17 @@ static inline void kvm_vm_bugged(struct kvm *kvm)
> > > unlikely(__ret); \
> > > })
> > >
> > > +#define KVM_BUG_ON_DATA_CORRUPTION(cond, kvm) \
> > > +({ \
> > > + int __ret = (cond); \
> > > + \
> > > + if (IS_ENABLED(CONFIG_BUG_ON_DATA_CORRUPTION)) \
> > > + BUG_ON(__ret); \
> > > + else if (WARN_ON_ONCE(__ret && !(kvm)->vm_bugged)) \
> > > + kvm_vm_bugged(kvm); \
> > > + unlikely(__ret); \
> > > +})
> > > +
> > > static inline void kvm_vcpu_srcu_read_lock(struct kvm_vcpu *vcpu)
> > > {
> > > #ifdef CONFIG_PROVE_RCU
> >
> > That sounds reasonable to me.
>
> Actually, I disagree after thinking about it for a while. Since
> Google turns on CONFIG_BUG_ON_DATA_CORRUPTION on default, this
> KVM_BUG_ON_DATA_CORRUPTION() is literally BUG_ON(). Then there is no
> point. The purpose of the series is to save the host from crash.

Sure, but Google is not the only consumer of KVM. E.g. I'm guessing many distros
use CONFIG_BUG_ON_DATA_CORRUPTION=n. Any user that doesn't have the infrastructure
to capture crash dumps is likely better served overall by the WARN-and-continue
behavior.

> If we follow the upper changes on get_mmio_spte() as well, we are
> introducing more BUG()!, ie., more chances to crash the host machine. So
> I cannot stay with this idea.

For get_mmio_spte() specifically, I am quite confident that in production kernels,
the benefits of a BUG() far outweigh the risk of unnecessarily panicking the host.
There have been recent-ish bugs that escaped into kernel releases, but the first
one (wrong root level) escaped only because it affected an esoteric, rarely used
configuration (shadow paging with a 32-bit host kernel), and the second one was
a purely theoretical bug fix that was also limited to 32-bit kernels.

39b4d43e6003 ("KVM: x86/mmu: Get root level from walkers when retrieving MMIO SPTE")
2aa078932ff6 ("KVM: x86/mmu: Use -1 to flag an undefined spte in get_mmio_spte()")

> In fact, there could many reasons why RMAPs and SPTEs are messed. In
> many times, as illustrated by my example, they are not data corruptions.
> They are just errors due to code refactoring or some miscalculations under
> certain tricky corner situations, e.g., offset by 1. So, treating them
> blindly as data corruptions is an overkill.

For SPTEs in prod, not really. Something has gone really wrong if KVM messes up a
SPTE to the point where _hardware_ unexpectedly observes a reserved bit in a
production build. We've had bugs where we've botched the software sanity checks,
e.g. see commit

6c6ab524cfae ("KVM: x86/mmu: Treat NX as a valid SPTE bit for NPT")

but except for rarely used setups (see above), unexpected reserved SPTE faults
in production pretty much won't happen unless there's memory corruption or a CPU
error.

For RMAPs, I agree it's less clear cut, but at the same time the status quo is to
always BUG(), so from an upstream perspective allowing the admin to avoid the
BUG() is an unqualified improvement.

> This is a comparison to the list traversal bug, which clearly shows data corruptions.

That depends on what is considered "data corruption". In this case,
CONFIG_BUG_ON_DATA_CORRUPTION is about the kernel memory structures being corrupted,
it's not referring to generic data corruption user data.

Select this option if the kernel should BUG when it encounters
data corruption in kernel memory structures when they get checked
for validity.

And IMO that definition fits KVM's behavior perfectly. get_mmio_spte() detects
corrupted SPTEs, pte_list_remove() detects corrupted RMAP lists.

If "data corruption" is interpeted to be corruption of user data than the list
behavior is wildly overstepping. E.g. if a subsystem has a bug where an object
is added to two different lists, list_add() will yell about the issue but because
all writes to the list_head are aborted with CONFIG_DEBUG_LIST=y, it's entirely
possible for the bug to be benign or limited to the subsystem.

A somewhat contrived example would be if a (complex) allocator and its caller both
add the object to the same list; the second list_add() is problematic and would
trigger BUG() with CONFIG_BUG_ON_DATA_CORRUPTION=y, but with CONFIG_DEBUG_LIST=y
the bug would be completely benign.

> So, alternatively, I think it should be reasonable to shutdown the KVM
> instance if we see things messed up in MMIO SPTEs as well, but please
> don not the host for that.

Again, this doesn't unconditionally bring down the host. Google is _choosing_
to BUG() on _potential_ data corruption. I'm open to not tieing KVM's behavior to
CONFIG_BUG_ON_DATA_CORRUPTION if there's a good argument for doing so, but if we
(Google) plan on turning on the BUG() behavior for KVM then I don't see the point.

In other words, we should discuss internally whether or not we want/need to push
for a separate control, but from an upstream perspective I think the proposed
behavior is entirely reasonable.