Re: [PATCH] KVM: mmu_notifier: make mn_invalidate_lock non-sleeping for non-blocking invalidations

From: Sean Christopherson

Date: Fri Mar 06 2026 - 11:44:24 EST

On Wed, Mar 04, 2026, shaikh kamaluddin wrote:
> On Wed, Feb 11, 2026 at 07:34:22AM -0800, Sean Christopherson wrote:
> > On Wed, Feb 11, 2026, Sebastian Andrzej Siewior wrote:
> > It's not at all clear to me that switching mmu_lock to a raw lock would be a net
> > positive for PREEMPT_RT. OOM-killing a KVM guest in a PREEMPT_RT seems like a
> > comically rare scenario. Whereas contending mmu_lock in normal operation is
> > relatively common (assuming there are even use cases for running VMs with a
> > PREEMPT_RT host kernel).
> >
> > In fact, the only reason the splat happens is because mmu_notifiers somewhat
> > artificially forces an atomic context via non_block_start() since commit
> >
> > ba170f76b69d ("mm, notifier: Catch sleeping/blocking for !blockable")
> >
> > Given the massive amount of churn in KVM that would be required to fully eliminate
> > the splat, and that it's not at all obvious that it would be a good change overall,
> > at least for now:
> >
> > NAK
> >
> > I'm not fundamentally opposed to such a change, but there needs to be a _lot_
> > more analysis and justification beyond "fix CONFIG_DEBUG_ATOMIC_SLEEP=y".
> >
> Hi Sean,
> Thanks for the detailed explanation and for spelling out the border
> issue.
> Understood on both points:
> 1. The changelog wording was too strong; PREEMPT_RT changes
> spin_lock() semantics, and the splat is fundamentally due to
> spinlocks becoming sleepable there.
> 2. Converting only mm_invalidate_lock to raw is insufficient
> since KVM can still take the mmu_lock (and other sleeping locks
> RT) in invalidate_range_start() when the invalidation hits a
> memslot.
> Given the above, it shounds like "convert locks to raw" is not the right
> direction without sinificat rework and justification.
> Would an acceptable direction be to handle the !blockable notifier case
> by deferring the heavyweight invalidation work(anything that take
> mmu_lock/may sleep on RT) to a context that may block(e.g. queued work),
> while keeping start()/end() accounting consisting with memslot changes ?

No, because the _only_ case where the invalidation is non-blockable is when the
kernel is OOM-killing. Deferring the invalidations when we're OOM is likely to
make the problem *worse*.

That's the crux of my NAK. We'd be making KVM and kernel behavior worse to "fix"
a largely hypothetical issue (OOM-killing a KVM guest in a RT kernel).