Re: [PATCH] KVM: mmu_notifier: make mn_invalidate_lock non-sleeping for non-blocking invalidations

From: shaikh kamaluddin

Date: Wed Mar 25 2026 - 01:19:55 EST


On Sat, Mar 14, 2026 at 08:47:40AM +0100, Paolo Bonzini wrote:
> On 3/12/26 20:24, shaikh kamaluddin wrote:
> > > > Alternatively, if you think this needs to be addressed in
> > > > mmu_notifiers(eg. how non_block_start() is applied), I'm happy to
> > > > redirect my efforts there-Please advise.
> > >
> > > Have you considered a "OOM entered" callback for MMU notifiers? KVM's MMU
> > > notifier can just remove itself for example, in fact there is code in
> > > kvm_destroy_vm() to do that even if invalidations are unbalanced.
> > >
> > > Paolo
> > >
> > Thanks for the suggestion! That's a much cleaner approach than what I was considering.
> >
> > If I understand correctly, the idea would be:
> > 1. Add a new MMU notifier callback (e.g., .oom_entered or .release_on_oom)
> > 2. Have KVM implement it to unregister the notifier when OOM reaper starts
> > 3. Leverage the existing kvm_destroy_vm() logic that already handles unbalanced invalidations
>
> Yes pretty much. Essentially, move the existing logic to the new callback
> and invoke it from kvm_destroy_vm().
>

Hi Paolo,
Thank you for the suggestion to use an oom_enter callback approach. I've implemented v2 based on your guidance and have successfully validated it.

Implementation Summary:
-------------------------------------
Following your recommendation, I've added a new oom_enter callback to the mmu_notifier_ops structure. The implementation:

1. Added oom_enter callback to struct mmu_notifier_ops in include/linux/mmu_notifier.h
2. Implemented __mmu_notifier_oom_enter() in mm/mmu_notifier.c to invoke registered callbacks
3. Called mmu_notifier_oom_enter(mm) from __oom_kill_process in mm/oom_kill.c before any invalidations
4. As per your suggestion, move existing kvm_destroy_vm() logic that already handles unbalanced invalidation to the new helper function kvm_mmu_notifier_detach() and invoke it from the kvm_destroy_vm()

Key Design Decision:
------------------------------
Implementation point no 4, while testing, Issue I was encountering is a recursive locking problem with the srcu lock, which is being acquired twice in the same context. This happens during the __mmu_notifier_oom_enter() and __synchronize_srcu() calls, leading to a potential deadlock.
Please find below log snippet while launching the Guest VM
------------------------------------------------------------------------------------------------
OOM_REAPER: START reaping:func:__mmu_notifier_oom_enter
[ 399.841599][T10882] OOM_REAPER: START reaping:func:__mmu_notifier_oom_enter
[ 399.841608][T10882] KVM: oom_enter callback invoked for VM:kvm_mmu_notifier_oom_enter
[ 399.841608][T10882] KVM: oom_enter callback invoked for VM:kvm_mmu_notifier_oom_enter
[ 399.841961][T10882]
[ 399.841961][T10882]
[ 399.841962][T10882] ============================================
[ 399.841962][T10882] ============================================
[ 399.841964][T10882] WARNING: possible recursive locking detected
[ 399.841964][T10882] WARNING: possible recursive locking detected
[ 399.841966][T10882] 7.0.0-rc2-00467-g4ae12d8bd9a8-dirty #12 Not tainted
[ 399.841966][T10882] 7.0.0-rc2-00467-g4ae12d8bd9a8-dirty #12 Not tainted
[ 399.841969][T10882] --------------------------------------------
[ 399.841969][T10882] --------------------------------------------
[ 399.841971][T10882] qemu-system-x86/10882 is trying to acquire lock:
[ 399.841971][T10882] qemu-system-x86/10882 is trying to acquire lock:
[ 399.841974][T10882] ffffffff8db05598 (srcu){.+.+}-{0:0}, at: __synchronize_srcu+0x83/0x380
[ 399.841974][T10882] ffffffff8db05598 (srcu){.+.+}-{0:0}, at: __synchronize_srcu+0x83/0x380
[ 399.841991][T10882]
[ 399.841991][T10882] but task is already holding lock:
[ 399.841991][T10882]
[ 399.841991][T10882] but task is already holding lock:
[ 399.841992][T10882] ffffffff8db05598 (srcu){.+.+}-{0:0}, at: __mmu_notifier_oom_enter+0x93/0x1f0
[ 399.841992][T10882] ffffffff8db05598 (srcu){.+.+}-{0:0}, at: __mmu_notifier_oom_enter+0x93/0x1f0
[ 399.842005][T10882]
[ 399.842005][T10882] other info that might help us debug this:
[ 399.842005][T10882]
[ 399.842005][T10882] other info that might help us debug this:
[ 399.842006][T10882] Possible unsafe locking scenario:
[ 399.842006][T10882]
[ 399.842006][T10882] Possible unsafe locking scenario:
[ 399.842006][T10882]
[ 399.842008][T10882] CPU0
[ 399.842008][T10882] CPU0
[ 399.842009][T10882] ----
[ 399.842009][T10882] ----
[ 399.842010][T10882] lock(srcu);
[ 399.842010][T10882] lock(srcu);
[ 399.842014][T10882] lock(srcu);
[ 399.842014][T10882] lock(srcu);
[ 399.842017][T10882]
[ 399.842017][T10882] *** DEADLOCK ***
[ 399.842017][T10882]
[ 399.842017][T10882]
[ 399.842017][T10882] *** DEADLOCK ***
[ 399.842017][T10882]
[ 399.842018][T10882] May be due to missing lock nesting notation
[ 399.842018][T10882]
[ 399.842018][T10882] May be due to missing lock nesting notation

-------------------------------------------------------------------------------------------------------------------
Then defered the kvm_mmu_notifier_detach() using workqueue, then above issue got fixed.


Testing:
-------------
I've validated the v2 approach with:

Kernel: v7.0-rc2 with PREEMPT_RT and DEBUG_ATOMIC_SLEEP enabled
Test: Triggered OOM conditions that killed a QEMU process with active KVM VM
Use these commands for generating scenario:
1. vng -v -r ./arch/x86/boot/bzImage --qemu-opts='-m 2G -cpu EPYC,+svm,+npt,+tsc,+invtsc -s '
After successfully booting the virtme-ng(QEMU) ------> Act Host VM
2. chmod 666 /dev/kvm
3. dmesg -c > /dev/null
4. launching Guest VM using this command $qemu-system-x86_64 -enable-kvm -m 1000M -mem-prealloc \
-monitor none -serial none -display none -nographic & sleep 10

Results:
-------------------
1. oom_enter callback was successfully invoked
2 No SRCU deadlock warnings
3 No "sleeping function called from invalid context" warnings
4.OOM reaper completed successfully
5. Process was reaped without errors



Question:
Before I send the v2 patch series, I want to confirm this approach aligns with your expectations. Specifically:
Defered this coommon helper kvm_mmu_notifier_detach() for mmu_nottifier_unregister() and unbalanced invalidation using workque is good design?
Are there any specific test cases or scenarios you'd like me to validate?

I can send the complete v2 patch series once you confirm this approach is on the right track.

Thanks again for the guidance!

Shaikh Kamal

> > This avoids the whole "convert locks to raw" problem and the complexity of deferring work.
> >
> > I have questions on Testing part:
> > ------------------------------------
> > I tried to reproduce the bug scenario using the virtme-ng then running
> > the stress-ng putting memory pressure on VM, but not able to reproduce
> > the scenario.
> > I tried this way ..
> > vng -v -r ./arch/x86/boot/bzImage
> > VM is up, then running the stress-ng as below
> > stress-ng --vm 2 --vm-bytes 95% --timeout 20s & sleep 5 & dmesg | tail -30 | grep "sleeping function"
> > OOM Killer is triggered, but exact bug not able to reproduce, Please
> > suggest how to reproduce this bug, even we need to verify after code
> > changes which you have suggested.
>
> I don't know, sorry. But with this new approach there will always be a call
> to the new callback from the OOM killer, so it's easier to test.
>
> Thanks,
>
> Paolo
>