Re: [PATCH 1/2] KVM: Add fault injection for some MMU operations
From: Sean Christopherson
Date: Wed Mar 04 2026 - 10:52:28 EST
On Wed, Aug 06, 2025, James Houghton wrote:
> Provide fault injection hooks for three operations:
> 1. For all architectures, retries due to invalidation notifiers.
> 2. For x86, TDP MMU cmpxchg updates for SPTEs.
> 3. For x86, TDP MMU SPTE iteration rescheduling.
>
> For all of these, fault injection can induce the uncommon cases: (1)
> that an invalidation occurred, (2) a cmpxchg failed, and (3) that the
> MMU lock is contended.
...
> @@ -689,7 +691,8 @@ static inline int __must_check __tdp_mmu_set_spte_atomic(struct kvm *kvm,
> * operates on fresh data, e.g. if it retries
> * tdp_mmu_set_spte_atomic()
> */
> - if (!try_cmpxchg64(sptep, &iter->old_spte, new_spte))
> + if (tdp_mmu_cmpxchg_should_fail() ||
> + !try_cmpxchg64(sptep, &iter->old_spte, new_spte))
As discovered internally, this can cause the WARN_ON_ONCE() at the end of
kvm_tdp_mmu_zap_possible_nx_huge_page() to fire, because the flow *guarantees*
success.
Thinking about this all a bit more, while I *really* like the idea of triggering
uncommon paths in theory, I'm having strong reservations about enabling this in
upstream, as I'm worried the signal:noise ratio could be abysmal.
For many configurations and setups, mmu_notifier invalidations and MMU lock
contention is actually quite common, i.e. in the aggregate, KVM actually gets
good coverage of those paths. Giving userspace a way to deliberate induce retry
for those cases doesn't seem like it will add much value, while at the same time
it could lead to a rash of "bugs" due to e.g. syzkaller setting extreme retry
percentages and manufacturing scenarios like stuck tasks that can't happen in
practice.
The CMPXCHG thing definitely has value, but as above even that is error prone to
some degree.
So if we want to take this forward, I think we should limit it to CMPXCHG, figure
out a clean way for callers to prevent failure injection, and set a fairly high
bar for extending failure injection to other areas. E.g. as was the case with
the CMPXCHG injection, a real KVM bug that is extremely rare in practice, but
relatively easy to trigger with artificial failure.