Re: [PATCH v2 01/20] locking/rt: Use raw_spin_lock_irqsave() in __rwbase_read_unlock()

From: David Woodhouse

Date: Mon Jun 01 2026 - 09:06:41 EST


On Mon, 2026-06-01 at 11:52 +0100, David Woodhouse wrote:
> On Sat, 2026-05-30 at 16:40 +0200, Paolo Bonzini wrote:
> > On Sat, May 30, 2026 at 3:04 PM David Woodhouse <dwmw2@xxxxxxxxxxxxx> wrote:
> > >
> > > On Sat, 2026-05-30 at 12:26 +0200, Paolo Bonzini wrote:
> > > >
> > > > Yeah, I think so.
> > > >
> > > > The write side needs kvm->srcu so it would have to be yet another SRCU.
> > > > I initially thought that sucks for the code that calls kvm_gpc_check(),
> > > > but maybe not because it simply replaces read_lock/read_unlock.
> > > >
> > > > By using a seqcount for the data, SRCU only needs to be synchronized in
> > > > gpc_unmap().  So, something like this:
> > >
> > > It isn't just gpc_unmap() which does the invalidation. We also
> > > invalidate from the MMU notifier in gfn_to_pfn_cache_invalidate_start()
> > > which would also have to synchronize, wouldn't it?
> >
> > You're right, the write_lock_irq() there drains the readers and that
> > is needed because khva is not pinned, only kmap()-ed.
> >
> > That is already broken for the OOM case under PREEMPT_RT, where
> > rwlock_t becomes sleepable. But using SRCU would break it on
> > !PREEMPT_RT as well.
>
> I don't think 'sleepable' is the problem per se, is it? *Why* does the
> OOM killer use mmu_notifier_invalidate_range_start_nonblock()?
>
> Commit 93065ac753e4 ("mm, oom: distinguish blockable mode for mmu
> notifiers") did say:
>
>     There are several blockable mmu notifiers which might sleep in
>     mmu_notifier_invalidate_range_start and that is a problem for the
>     oom_reaper because it needs to guarantee a forward progress so it cannot
>     depend on any sleepable locks.
>
> But that was in 2018, when mmap_lock was an rw_semaphore.
>
> Is "sleepable" still a problem even when PREEMPT_RT where almost
> *everything* is now strictly sleepable? Wouldn't that mean drivers
> aren't even allowed to take their own spinlocks?

Yeah, this is *already* hosed by PREEMPT_RT's "haha let's make things
sleepable that nobody ever expected to be" approach.

It's hard to trigger as not only do you have to get the KVM process to
OOM, it also has to be *slow* to die. I ended up doing this:

--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1278,6 +1278,10 @@ void exit_mmap(struct mm_struct *mm)
VMA_ITERATOR(vmi, mm, 0);
struct unmap_desc unmap;

+ if (!strcmp(current->comm, "kvm_oom_test")) {
+ pr_info("exit_mmap: delaying before mmu_notifier_release for kvm_oom_test\n");
+ schedule_timeout_uninterruptible(3*HZ);
+ }
/* mm's last user has gone, and its about to be pulled down */
mmu_notifier_release(mm);

And then we see it even when taking kvm->mn_invalidate_lock:

kvm_mmu_notifier_invalidate_range_start+0xac
0xffffffff8132732c is in kvm_mmu_notifier_invalidate_range_start (arch/x86/kvm/../../../virt/kvm/kvm_main.c:745).
740 * adjustments will be imbalanced.
741 *
742 * Pairs with the decrement in range_end().
743 */
744 spin_lock(&kvm->mn_invalidate_lock);
745 kvm->mn_active_invalidate_count++;
746 if (!mmu_notifier_range_blockable(range))
747 pr_info("KVM: non-blockable invalidate_range_start, non_block_count=%d\n", current->non_block_count);
748 spin_unlock(&kvm->mn_invalidate_lock);
749


[ 427.919969] mmap: exit_mmap: delaying before mmu_notifier_release for kvm_oom_test
[ 429.926972] BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:48
[ 429.926978] in_atomic(): 0, irqs_disabled(): 0, non_block: 1, pid: 280, name: oom_reaper
[ 429.926982] preempt_count: 0, expected: 0
[ 429.926984] RCU nest depth: 0, expected: 0
[ 429.926986] 4 locks held by oom_reaper/280:
[ 429.926989] #0: ffff8a61da779cb0 (&mm->mmap_lock){....}-{3:3}, at: oom_reaper+0x150/0x520
[ 429.927006] #1: ffffffffa0934f20 (mmu_notifier_invalidate_range_start){....}-{0:0}, at: zap_vma_for_reaping+0xb7/0x1d0
[ 429.927019] #2: ffffffffa0934f78 (srcu){....}-{0:0}, at: __mmu_notifier_invalidate_range_start+0xae/0x340
[ 429.927029] #3: ffff8a6240295360 (&kvm->mn_invalidate_lock){....}-{2:2}, at: kvm_mmu_notifier_invalidate_range_start+0xac/0x4b0
[ 429.927044] CPU: 26 UID: 0 PID: 280 Comm: oom_reaper Tainted: G S I 7.1.0-rc2+ #2460 PREEMPT_{RT,(lazy)}
[ 429.927051] Tainted: [S]=CPU_OUT_OF_SPEC, [I]=FIRMWARE_WORKAROUND
[ 429.927053] Hardware name: Intel Corporation S2600CW/S2600CW, BIOS SE5C610.86B.01.01.0008.021120151325 02/11/2015
[ 429.927055] Call Trace:
[ 429.927058] <TASK>
[ 429.927062] dump_stack_lvl+0x6e/0xa0
[ 429.927074] __might_resched.cold+0xeb/0x100
[ 429.927084] rt_spin_lock+0x6c/0x1a0
[ 429.927092] ? kvm_mmu_notifier_invalidate_range_start+0xac/0x4b0
[ 429.927102] kvm_mmu_notifier_invalidate_range_start+0xac/0x4b0
[ 429.927110] ? sched_update_numa+0xa0/0x270
[ 429.927129] __mmu_notifier_invalidate_range_start+0x129/0x340
[ 429.927138] ? __pfx_oom_reaper+0x10/0x10
[ 429.927144] zap_vma_for_reaping+0x186/0x1d0
[ 429.927150] ? zap_vma_for_reaping+0xb7/0x1d0
[ 429.927155] ? zap_vma_for_reaping+0xb7/0x1d0
[ 429.927176] __oom_reap_task_mm+0xbf/0x100
[ 429.927191] oom_reaper+0xeb/0x520
[ 429.927199] ? __pfx_autoremove_wake_function+0x10/0x10
[ 429.927212] kthread+0xf5/0x130
[ 429.927217] ? __pfx_kthread+0x10/0x10
[ 429.927224] ret_from_fork+0x286/0x310
[ 429.927232] ? __pfx_kthread+0x10/0x10
[ 429.927236] ret_from_fork_asm+0x1a/0x30
[ 429.927257] </TASK>
[ 429.927260] KVM: non-blockable invalidate_range_start, non_block_count=1

Attachment: smime.p7s
Description: S/MIME cryptographic signature