[PATCH 0/9] sched: make WARN_ON under rq->lock deadlock-safe (SCHED_WARN_ON)

From: Rik van Riel

Date: Wed Jun 10 2026 - 22:23:51 EST


A plain WARN_ON()/WARN_ON_ONCE()/WARN()/WARN_ONCE() emits at KERN_WARNING.
On a machine with a legacy or boot console registered (e.g. console=ttyS0),
vprintk_emit() takes the synchronous "legacy_direct" path:

console_trylock_spinning() + console_unlock()
-> up(&console_sem) -> wake_up_process() -> try_to_wake_up()

which grabs the woken task's ->pi_lock and its rq->lock. Almost every WARN
in the scheduling-class hot paths fires while the current CPU already holds
an rq->lock or a ->pi_lock, so this re-enters the scheduler and can deadlock
(recursively on rq->lock, or via the pi_lock/rq->lock order). The nbcon and
klogd wakeups are deferred via irq_work and are safe; only the legacy console
path is synchronous.

This deadlock bit us when the WARN_ON_ONCE in __sum_w_vruntime_add()
fired, but many other WARN instances in the scheduler code appear to
be vulnerable to the exact same deadlock.

The scheduler already works around this in a handful of spots by using
printk_deferred() instead of printk(), but WARN_ON() has no such variant: it
emits at KERN_WARNING, not LOGLEVEL_SCHED, so it is not deferred.

Patch 1 adds SCHED_WARN_ON()/SCHED_WARN_ON_ONCE() and the SCHED_WARN()/
SCHED_WARN_ONCE() message-carrying forms, which behave exactly like their
WARN*() counterparts but bracket the report in a printk_deferred section so
the console output is handed to irq_work instead of being emitted
synchronously. The bracket is entered only on the (cold) firing path, so the
hot path cost is unchanged -- just the condition test. (printk_deferred
toggles a per-CPU counter that must be balanced on one CPU, which is
guaranteed because rq->lock/->pi_lock are raw_spinlock_t and disable
preemption; a lockdep_assert_preemption_disabled() catches misuse and
compiles away without CONFIG_PROVE_LOCKING.)

Patches 2-9 convert, one file at a time (for bisectability), the WARN*()
calls that execute under rq->lock or ->pi_lock. WARN sites in
setup/teardown/sysfs/preemptible paths are left alone (they are not in the
hazard class and SCHED_WARN_ON would trip the lockdep assert there).

1 sched: add SCHED_WARN_ON()/.../SCHED_WARN_ONCE()
2 sched/core (34 sites)
3 sched/fair (33)
4 sched/deadline (36)
5 sched/rt (17)
6 sched_ext (49)
7 sched/core_sched (3)
8 sched/deadline (cpudeadline.c) (3)
9 sched/rt (cpupri.c) (1)

Not converted: two sites with mixed preemptible/locked callers
(nohz_balance_exit_idle() in fair.c, next_task_group() in rt.c) are left as
plain WARN_ON_ONCE() pending per-site analysis -- converting them blindly
would risk a false lockdep_assert_preemption_disabled() on the preemptible
caller path.

Built (full bzImage, x86_64) on tip sched/core. Based on sched/core plus the
EEVDF reweight vlag-clamp fix; the series itself is independent of that fix.

Split up into one patch per .c file in kernel/sched to make things a
little less unwieldy.

Rik van Riel (9):
sched: add SCHED_WARN_ON()/SCHED_WARN_ON_ONCE()/SCHED_WARN()/SCHED_WARN_ONCE()
sched/core: defer WARN console output under rq->lock
sched/fair: defer WARN console output under rq->lock
sched/deadline: defer WARN console output under rq->lock
sched/rt: defer WARN console output under rq->lock
sched_ext: defer WARN console output under rq->lock
sched/core_sched: defer WARN console output under rq->lock
sched/deadline: defer WARN console output under rq->lock
sched/rt: defer WARN console output under rq->lock

kernel/sched/core.c | 68 +++++++++++++-------------
kernel/sched/core_sched.c | 6 +--
kernel/sched/cpudeadline.c | 6 +--
kernel/sched/cpupri.c | 2 +-
kernel/sched/deadline.c | 72 ++++++++++++++--------------
kernel/sched/ext.c | 98 +++++++++++++++++++-------------------
kernel/sched/fair.c | 66 ++++++++++++-------------
kernel/sched/rt.c | 34 ++++++-------
kernel/sched/sched.h | 52 ++++++++++++++++++++
9 files changed, 228 insertions(+), 176 deletions(-)


base-commit: eaf710f74e602ea2fd517f798066d6988072f3ae
--
2.53.0-Meta