Re: sched_ext/lavd hard lockup in old call_rcu_tasks_generic needadjust path

From: Matt Fleming

Date: Fri Jun 12 2026 - 06:56:12 EST


On Thu, Jun 11, 2026 at 06:45:14AM -0700, Paul E. McKenney wrote:
> On Thu, Jun 11, 2026 at 02:02:58PM +0100, Matt Fleming wrote:
> > On Tue, Jun 09, 2026 at 04:23:23AM -0700, Paul E. McKenney wrote:
> > >
> > > Does commenting out the 'call_rcu_tasks_generic+547' printk() avoid the
> > > issue? If so, that printk() might be deferred or some such.
> > >
> > > "But if you cannot trust printk(), what *can* you trust?" ;-)
> >
> > I tried this and it still crashes so apparently not! I'll keep digging
> > to find the real cause (it's somewhat cumbersome to reproduce this hard
> > lockup).
>
> If this code path is nevertheless involved, one thing that might speed
> things up would be to do bursts of call_rcu_tasks() from lots of CPUs,
> then avoid doing any call_rcu_tasks() for some time, then do a single
> isolated call_rcu_tasks(). Or maybe you are already doing this.

Thanks, I managed to shrink the time to reproduce the lockup and it's
now clear that the bug is an ABBA deadlock on cbs_gbl_lock.

CPU #1
==========
sched_ext_free()
task_rq_lock() // acquires rq->lock
scx_exit_task()
SCX_CALL_OP_TASK(.exit_task)
bpf_task_storage_delete()
bpf_selem_unlink()
bpf_selem_unlink_storage()
bpf_selem_free()
call_rcu_tasks_trace()
call_rcu_tasks_generic()
raw_spin_lock(cbs_gbl_lock) // BLOCKS: CPU #2 owns it
CPU #2
==========
rcu_tasks_kthread()
rcu_tasks_one_gp()
raw_spin_lock(cbs_gbl_lock) // acquired
pr_info("Starting switch ...") // still under cbs_gbl_lock
console_unlock()
wake_up_q()
try_to_wake_up(repro)
raw_spin_lock(rq->lock) // BLOCKS: CPU #1 owns it

Given this, I can see why removing the single printk() didn't fix
anything, and we can expect any code path under cbs_gbl_lock that wakes
a task could trigger this hard lockup. Right now in 6.18 that's a few
printk()s and a WARN_ON_ONCE().

This issue doesn't exist in v7.0-rc1 because of commit c27cea4416a3
("rcu: Re-implement RCU Tasks Trace in terms of SRCU-fast").

Is there an easy way to defer call_rcu_tasks_generic() work so it gets
executed without rq->lock being held? I'm assuming backporting the SRCU
patches would be too invasive for LTS?

Thanks,
Matt