Re: [PATCH] KVM: mmu_notifier: make mn_invalidate_lock non-sleeping for non-blocking invalidations

From: Paolo Bonzini

Date: Fri Mar 06 2026 - 13:14:55 EST

On 3/3/26 19:49, shaikh kamaluddin wrote:

On Wed, Feb 11, 2026 at 07:34:22AM -0800, Sean Christopherson wrote:

On Wed, Feb 11, 2026, Sebastian Andrzej Siewior wrote:

On 2026-02-09 21:45:27 [+0530], shaikh.kamal wrote:

mmu_notifier_invalidate_range_start() may be invoked via
mmu_notifier_invalidate_range_start_nonblock(), e.g. from oom_reaper(),
where sleeping is explicitly forbidden.

KVM's mmu_notifier invalidate_range_start currently takes
mn_invalidate_lock using spin_lock(). On PREEMPT_RT, spin_lock() maps
to rt_mutex and may sleep, triggering:

BUG: sleeping function called from invalid context

This violates the MMU notifier contract regardless of PREEMPT_RT;

I highly doubt that. kvm.mmu_lock is also a spinlock, and KVM has been taking
that in invalidate_range_start() since

e930bffe95e1 ("KVM: Synchronize guest physical memory map to host virtual memory map")

which was a full decade before mmu_notifiers even added the blockable concept in

93065ac753e4 ("mm, oom: distinguish blockable mode for mmu notifiers")

and even predate the current concept of a "raw" spinlock introduced by

c2f21ce2e312 ("locking: Implement new raw_spinlock")

RT kernels merely make the issue deterministic.

No, RT kernels change the rules, because suddenly a non-sleeping locking becomes
sleepable.

Fix by converting mn_invalidate_lock to a raw spinlock so that
invalidate_range_start() remains non-sleeping while preserving the
existing serialization between invalidate_range_start() and
invalidate_range_end().

This is insufficient. To actually "fix" this in KVM mmu_lock would need to be
turned into a raw lock on all KVM architectures. I suspect the only reason there
haven't been bug reports is because no one trips an OOM kill on VM while running
with CONFIG_DEBUG_ATOMIC_SLEEP=y.

That combination is required because since commit

8931a454aea0 ("KVM: Take mmu_lock when handling MMU notifier iff the hva hits a memslot")

KVM only acquires mmu_lock if the to-be-invalidated range overlaps a memslot,
i.e. affects memory that may be mapped into the guest.

E.g. this hack to simulate a non-blockable invalidation

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 7015edce5bd8..7a35a83420ec 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -739,7 +739,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
.handler = kvm_mmu_unmap_gfn_range,
.on_lock = kvm_mmu_invalidate_begin,
.flush_on_ret = true,
- .may_block = mmu_notifier_range_blockable(range),
+ .may_block = false,//mmu_notifier_range_blockable(range),
};
trace_kvm_unmap_hva_range(range->start, range->end);
@@ -768,6 +768,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
*/
gfn_to_pfn_cache_invalidate_start(kvm, range->start, range->end);
+ non_block_start();
/*
* If one or more memslots were found and thus zapped, notify arch code
* that guest memory has been reclaimed. This needs to be done *after*
@@ -775,6 +776,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
*/
if (kvm_handle_hva_range(kvm, &hva_range).found_memslot)
kvm_arch_guest_memory_reclaimed(kvm);
+ non_block_end();
return 0;
}

immediately triggers

BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:241
in_atomic(): 0, irqs_disabled(): 0, non_block: 1, pid: 4992, name: qemu
preempt_count: 0, expected: 0
RCU nest depth: 0, expected: 0
CPU: 6 UID: 1000 PID: 4992 Comm: qemu Not tainted 6.19.0-rc6-4d0917ffc392-x86_enter_mmio_stack_uaf_no_null-rt #1 PREEMPT_RT
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
Call Trace:
<TASK>
dump_stack_lvl+0x51/0x60
__might_resched+0x10e/0x160
rt_write_lock+0x49/0x310
kvm_mmu_notifier_invalidate_range_start+0x10b/0x390 [kvm]
__mmu_notifier_invalidate_range_start+0x9b/0x230
do_wp_page+0xce1/0xf30
__handle_mm_fault+0x380/0x3a0
handle_mm_fault+0xde/0x290
__get_user_pages+0x20d/0xbe0
get_user_pages_unlocked+0xf6/0x340
hva_to_pfn+0x295/0x420 [kvm]
__kvm_faultin_pfn+0x5d/0x90 [kvm]
kvm_mmu_faultin_pfn+0x31b/0x6e0 [kvm]
kvm_tdp_page_fault+0xb6/0x160 [kvm]
kvm_mmu_do_page_fault+0xee/0x1f0 [kvm]
kvm_mmu_page_fault+0x8d/0x600 [kvm]
vmx_handle_exit+0x18c/0x5a0 [kvm_intel]
kvm_arch_vcpu_ioctl_run+0xc70/0x1c90 [kvm]
kvm_vcpu_ioctl+0x2d7/0x9a0 [kvm]
__x64_sys_ioctl+0x8a/0xd0
do_syscall_64+0x5e/0x11b0
entry_SYSCALL_64_after_hwframe+0x4b/0x53
</TASK>
kvm: emulating exchange as write

It's not at all clear to me that switching mmu_lock to a raw lock would be a net
positive for PREEMPT_RT. OOM-killing a KVM guest in a PREEMPT_RT seems like a
comically rare scenario. Whereas contending mmu_lock in normal operation is
relatively common (assuming there are even use cases for running VMs with a
PREEMPT_RT host kernel).

In fact, the only reason the splat happens is because mmu_notifiers somewhat
artificially forces an atomic context via non_block_start() since commit

ba170f76b69d ("mm, notifier: Catch sleeping/blocking for !blockable")

Given the massive amount of churn in KVM that would be required to fully eliminate
the splat, and that it's not at all obvious that it would be a good change overall,
at least for now:

NAK

I'm not fundamentally opposed to such a change, but there needs to be a _lot_
more analysis and justification beyond "fix CONFIG_DEBUG_ATOMIC_SLEEP=y".

Hi Sean,
Thanks for the detailed explanation and for spelling out the border
issue.
Understood on both points:
1. The changelog wording was too strong; PREEMPT_RT changes
spin_lock() semantics, and the splat is fundamentally due to
spinlocks becoming sleepable there.
2. Converting only mm_invalidate_lock to raw is insufficient
since KVM can still take the mmu_lock (and other sleeping locks
RT) in invalidate_range_start() when the invalidation hits a
memslot.
Given the above, it shounds like "convert locks to raw" is not the right
direction without sinificat rework and justification.
Would an acceptable direction be to handle the !blockable notifier case
by deferring the heavyweight invalidation work(anything that take
mmu_lock/may sleep on RT) to a context that may block(e.g. queued work),
while keeping start()/end() accounting consisting with memslot changes ?
if so, I can protoptype a patch along those lines and share for
feedback.

Alternatively, if you think this needs to be addressed in
mmu_notifiers(eg. how non_block_start() is applied), I'm happy to
redirect my efforts there-Please advise.

Have you considered a "OOM entered" callback for MMU notifiers? KVM's MMU notifier can just remove itself for example, in fact there is code in kvm_destroy_vm() to do that even if invalidations are unbalanced.

Paolo